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Break it down 


Data is everywhere. 

Nowadays, everyone has to deal with mounds of data, whether they call 
themselves “data analysts” or not. But people who possess a toolbox of data 
analysis skills have a massive edge on everyone else, because they understand 
what to do with all that stuff. They know how to translate raw numbers into 
intelligence that drives real-world action. They know how to break down and 
structure complex problems and data sets to get right to the heart of the problems 
in their business. 
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Take it to the max 

We all want more of something. 

And we’re always trying to figure out how to get it. If the things we want more of — 
profit, money, efficiency, speed — can be represented numerically, then chances 
are, there’s an tool of data analysis to help us tweak our decision variables, which 
will help us find the solution or optimal point where we get the most of what 
we want. In this chapter, you’ll be using one of those tools and the powerful 
spreadsheet Solver package that implements it. 
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Say it ain’t so 

The world can be tricky to explain. 
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answer involves an extremely handy analytic tool called Bayes’ rule, which will help 
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Numerical belief 


Sometimes, it’s a good idea to make up numbers. 
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Analyze like a human 

The real world has more variables than you can handle. 
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Err well 
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The Top Ten Things (we didn’t cover) 


5 ve come a long way. 

data analysis is a vast and constantly evolving field, and there’s so much left the 
arn. In this appendix, we’ll go over ten items that there wasn’t enough room to cover 
in this book but should be high on your list of topics to learn about next. 
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Install X 



Start R up! 

Behind all that data-crunching power is enormous 
complexity. 

But fortunately, getting R installed and started is something you can accomplish in 
just a few minutes, and this appendix is about to show you how to pull off your R 
install without a hitch. 


Get started with R 


428 
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Install excel analysis tQQls 

The ToolPak 

Some of the best features of Excel aren’t installed by 
default. 

That’s right, in order to run the optimization from Chapter 3 and the histograms from 
Chapter 9, you need to activate the Solver and the Analysis ToolPak, two extensions 
that are included in Excel by default but not activated without your initiative. 


Install the data analysis tools in Excel 
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Is this book for you? 
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a 9»*eat gift for that 
special someone. 


I can’t believe 
they put f/?af in a 
data analysis book 
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how to use this book 


Who is this book for? 


If you can answer “yes” to all of these: 


o 

❺ 


Do you feel like there’s a world of insights buried in your 
data that you’d only be able to access if you had the right 
tools? 

Do you want to learn, understand, and remember how 
to create brilliant graphics, test hypotheses, run a 
regression, or clean up messy data? 



Do you prefer stimulating dinner party conversation to dry, 
dull, academic lectures? 


this book is for you. 


Who should probably back away from this book? 

If you can answer “yes” to any of these: 


o 

O 

o 


Are you a seasoned, brilliant data analyst looking for a 
survey of bleeding edge data topics? 

Have you never loaded and used Microsoft Excel or 
OpenOffice calc? 


Are you afraid to try something different? Would you 
rather have a root canal than mix stripes with plaid? 
Do you believe that a technical book can’t be serious 
if it anthropomorphizes control groups and objective 
functions? 


this book is not for you. 



^ -this book is 

^ 3 h y° hC with a tYtd\i ca\rdJ 
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Wc know what youVe thinking 




u 


How can this be a serious data analysis book? 
What’s with all the graphics?” 

Gan I actually learn it this way?” 


55 


Wc kwow what your brain is thinking 

Your brain craves novelty. It’s always searching, scanning, waiting for something 
unusual. It was built that way, and it helps you stay alive. 

So what does your brain do with all the routine, ordinary, normal things 
you encounter? Everything it can to stop them from interfering with the 
brain’s real job — recording things that matter. It doesn’t bother saving the 
boring things; they never make it past the “this is obviously not important” 
filter. 

How does your brain know what’s important? Suppose you’re out for a day 
hike and a tiger jumps in front of you, what happens inside your head and 
body? 

Neurons fire. Emotions crank up. Chemicals surge. 

And that’s how your brain knows... 

This must be important! Don’t forget it! 

But imagine you’re at home, or in a library. It’s a safe, warm, tiger-free zone. ^JomX ^ 
You’re studying. Getting ready for an exam. Or trying to learn some tough 
technical topic your boss thinks will take a week, ten days at the most. 

Just one problem. Your brain’s trying to do you a big favor. It’s trying to 
make sure that this obviously non-important content doesn’t clutter up scarce 
resources. Resources that are better spent storing the really big things. 

Like tigers. Like the danger of fire. Like how you should never have 
posted those “party” photos on your Facebook page. And there’s no 
simple way to tell your brain, “Hey brain, thank you very much, but 
no matter how dull this book is, and how little I’m registering on the 
emotional Richter scale right now, I really do want you to keep this 
stuff around.” 
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Metacognition: thinking about thmking 

If you really want to learn, and you want to learn more quickly and more 
deeply, pay attention to how you pay attention. Think about how you think. 
Learn how you learn. 

Most of us did not take courses on metacognition or learning theory when we 
were growing up. We were expected to learn, but rarely taught to learn. 


I wonder how 
I can trick my brain 
into remembering 
this stuff... 



The trick is to get your brain to see the new material you’re learning as 
Really Important. Crucial to your well-being. As important as a tiger. 
Otherwise, you’re in for a constant battle, with your brain doing its best to 
keep the new content from sticking. 


But we assume that if you’re holding this book, you really want to learn data 
analysis. And you probably don’t want to spend a lot of time. If you want to 
use what you read in this book, you need to remember what you read. And for 
that, you’ve got to understand it. To get the most from this book, or any book 
or learning experience, take responsibility for your brain. Your brain on this 
content. 


So just how DO you get your brain to treat data 
analysis like it was a hungry tiger? 


There’s the slow, tedious way, or the faster, more effective way. The 
slow way is about sheer repetition. You obviously know that you are able to learn 
and remember even the dullest of topics if you keep pounding the same thing into your 
brain. With enough repetition, your brain says, “This doesn’t feel important to him, but he 
keeps looking at the same thing over and over and over, so I suppose it must be.” 


The faster way is to do anything that increases brain activity, especially different 
types of brain activity. The things on the previous page are a big part of the solution, 
and they’re all things that have been proven to help your brain work in your favor. For 
example, studies show that putting words within the pictures they describe (as opposed to 
somewhere else in the page, like a caption or in the body text) causes your brain to try to 
makes sense of how the words and picture relate, and this causes more neurons to fire. 
More neurons firing = more chances for your brain to get that this is something worth 
paying attention to, and possibly recording. 


A conversational style helps because people tend to pay more attention when they 
perceive that they’re in a conversation, since they’re expected to follow along and hold up 
their end. The amazing thing is, your brain doesn’t necessarily care that the “conversation” 
is between you and a book! On the other hand, if the writing style is formal and dry, your 
brain perceives it the same way you experience being lectured to while sitting in a roomful 
of passive attendees. No need to stay awake. 


But pictures and conversational style are just the beginning … 
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Here's what WE did: 

We used pictures, because your brain is tuned for visuals, not text. As far as your brain’s 
concerned, a picture really is worth a thousand words. And when text and pictures work 
together, we embedded the text in the pictures because your brain works more effectively 
when the text is within the thing the text refers to, as opposed to in a caption or buried in the 
text somewhere. 

We used redundancy : saying the same thing in different ways and with different media types, 
and multiple senses, to increase the chance that the content gets coded into more than one area 
of your brain. 

We used concepts and pictures in unexpected ways because your brain is tuned for novelty, 
and we used pictures and ideas with at least some emotional content, because your brain 
is tuned to pay attention to the biochemistry of emotions. That which causes you to feel 
something is more likely to be remembered, even if that feeling is nothing more than a little 

humor, surprise, or interest. 


We used a personalized, conversational style, because your brain is tuned to pay more 
attention when it believes you’re in a conversation than if it thinks you’re passively listening 
to a presentation. Your brain does this even when you’re reading. 

We included more than 80 activities, because your brain is tuned to learn and remember 
more when you do things than when you read about things. And we made the exercises 
challenging-yet-do-able, because that’s what most people prefer. 

We used multiple learning styles, because joz/ might prefer step-by-step procedures, while 
someone else wants to understand the big picture first, and someone else just wants to see 
an example. But regardless of your own learning preference, everyone benefits from seeing the 
same content represented in multiple ways. 


We include content for both sides of your brain, because the more of your brain you 
engage, the more likely you are to learn and remember, and the longer you can stay focused. 
Since working one side of the brain often means giving the other side a chance to rest, you 
can be more productive at learning for a longer period of time. 

And we included stories and exercises that present more than one point of view ， 
because your brain is tuned to learn more deeply when it’s forced to make evaluations and 
judgments. 

We included challenges, with exercises, and by asking questions that don’t always have 
a straight answer, because your brain is tuned to learn and remember when it has to work at 
something. Think about it — you can’t get your body in shape just by watching people at the 
gym. But we did our best to make sure that when you’re working hard, it’s on the right things. 
Thatj^oiiV^ not spending one extra dendrite processing a hard-to-understand example, 
or parsing difficult, jargon-laden, or overly terse text. 

We used people. In stories, examples, pictures, etc., because, well, because a person. 
And your brain pays more attention to people than it does to things. 
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Here's what YOU caw do to bend 
your brain into submission 

So, we did our part. The rest is up to you. These tips are a 
starting point; listen to your brain and figure out what works 
for you and what doesn’t. Try new things. 

Cu*t -this ou*t 3hd 乙 k |七 

° h y° u,r 


o Slow down. The more you understand, the 
less you have to memorize. 

Don’t just read. Stop and think. When the book asks 
you a question, don’t just skip to the answer. Imagine 
that someone really is asking the question. The 
more deeply you force your brain to think, the better 
chance you have of learning and remembering. 


❺ Do the exercises. Write your own notes. 

We put them in, but if we did them for you, that 
would be like having someone else do your workouts 
for you. And don’t just look at the exercises. Use a 
pencil. There’s plenty of evidence that physical 
activity while learning can increase the learning. 

❺ Read the “There are No Dumb Questions” 

That means all of them. They’re not optional 
sidebars, they ’re part of the core content! 

Don’t skip them. 


Make this the last thing you read before bed. 
Or at least the last challenging thing. 


Part of the learning (especially the transfer to 
long-term memory) happens afteryow put the book 
down. Your brain needs time on its own, to do more 
processing. If you put in something new during that 
processing time, some of what you just learned will 
be lost. 


(@) Talk about it. Out loud. 

Speaking activates a different part of the brain. If 
you’re trying to understand something, or increase 
your chance of remembering it later, say it out loud. 
Better still, try to explain it out loud to someone else. 
You’ll learn more quickly, and you might uncover 
ideas you hadn’t known were there when you were 
reading about it. 


o Drink water. Lots of it. 

Your brain works best in a nice bath of fluid. 
Dehydration (which can happen before you ever 
feel thirsty) decreases cognitive function. 

o Listen to your brain. 

Pay attention to whether your brain is getting 
overloaded. If you find yourself starting to skim 
the surface or forget what you just read, it’s time 
for a break. Once you go past a certain point, you 
won’t learn faster by trying to shove more in, and 
you might even hurt the process. 

o Feel something. 

Your brain needs to know that this matters. Get 
involved with the stories. Make up your own 
captions for the photos. Groaning over a bad joke 
is still better than feeling nothing at all. 

Get your hands dirty! 

There’s only one way to learn data analysis: get 
your hands dirty. And that’s what you’re going to do 
throughout this book. Data analysis is a skill, and 
the only way to get good at it is to practice. We’re 
going to give you a lot of practice: every chapter has 
exercises that pose a problem for you to solve. Don’t 
just skip over them — a lot of the learning happens 
when you solve the exercises. We included a solution 
to each exercise — don’t be afraid to peek at the 
solution if you get stuck! (It’s easy to get snagged 
on something small.) But try to solve the problem 
before you look at the solution. And definitely get it 
working before you move on to the next part of the 
book. 
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how to use this book 


Read Me 


This is a learning experience, not a reference book. We deliberately stripped out everything 
that might get in the way of learning whatever it is we’re working on at that point in the 
book. And the first time through, you need to begin at the beginning, because the book 
makes assumptions about what you’ve already seen and learned. 


This book is not about software tools. 

Many books with “data analysis” in their titles simply go down the list of Excel functions 
considered to be related to data analysis and show you a few examples of each. Head First 
Data Analysis, on the other hand, is about how to be a data analyst. You’ll learn quite a 
bit about software tools in this book, but they are only a means to the end of learning how 
to do good data analysis. 


We expect you to know how to use basic spreadsheet formulas. 

Have you ever used the SUM formula in a spreadsheet? If not, you may want to bone up on 
spreadsheets a little before beginning this book. While many chapters do not ask you to use 
spreadsheets at all, the ones that do assume that you know how to use formulas. If you are 
familiar with the SUM formula, then you’re in good shape. 

This book is about more than statistics. 

There’s plenty of statistics in this book, and as a data analyst you should learn as much 
statistics as you can. Once you’re finished with Head First Data Analysis, it’d be a good idea 
to read Head First Statistics as well. But “data analysis” encompasses statistics and a number 
of other fields, and the many non-statistical topics chosen for this book are focused on the 
practical, nitty-gritty experience of doing data analysis in the real world. 

The activities are NOT optional. 

The exercises and activities are not add-ons; they’re part of the core content of the book. 
Some of them are to help with memory, some are for understanding, and some will help 
you apply what you’ve learned. Don } t skip the exercises. The crossword puzzles are the 
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only thing you don’t have to do, but they’re good for giving your brain a chance to think 
about the words and terms you’ve been learning in a different context. 


The redundancy is intentional and important. 

One distinct difference in a Head First book is that we want you to really get it. And we 
want you to finish the book remembering what you’ve learned. Most reference books 
don’t have retention and recall as a goal, but this book is about learning, so you’ll see some 
of the same concepts come up more than once. 


The book doesn’t end here. 

We love it when you can find fun and useful extra stuff on book companion sites. You’ll 
find extra stuff on data analysis at the following url: 

http: f f zvzvzv. headfirstlabs • com/ books/ hfda/. 

The Brain Power exercises don’t have answers. 


For some of them, there is no right answer, and for others, part of the learning 
experience of the Brain Power activities is for you to decide if and when your answers 
are right. In some of the Brain Power exercises, you will find hints to point you in the 
right direction. 


the review team 



The technical review team 




Bill Miciclsk 


Technical Reviewers: 


Eric Heilman graduated Phi Beta Kappa from the Walsh School of Foreign Service at Georgetown University with 
a degree in International Economics. During his time as an undergraduate in DC, he worked at the State Department 
and at the National Economic Council at the White House. He completed his graduate work in economics at the 
University of Chicago. He currently teaches statistical analysis and math at Georgetown Preparatory School in 
Bethesda, MD. 

Bill Mietelski is a Software Engineer and a three-time Head First technical reviewer. He can’t wait to run a data 
analysis on his golf stats to help him win on the links. 


Anthony Rose has been working in the data analysis field for nearly ten years and is currently the president of 
Support Analytics, a data analysis and visualization consultancy. Anthony has an MBA concentrated in Management 
and Finance degree, which is where his passion for data and analysis started. When he isn’t working, he can normally 
be found on the golf course in Columbia, Maryland, lost in a good book, savoring a delightful wine, or simply enjoying 
time with his young girls and amazing wife. 
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safari books online 


Safari® Pooks Online 


Safari^ 

Books Online 


When you see a Safari® icon on the cover of your favorite 
technology book that means the book is available online through 
the O’Reilly Network Safari Bookshelf. 


Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easily 
search thousands of top tech books, cut and paste code samples, download chapters, 
and find quick answers when you need the most accurate, current information. Try it 
for free at http://my. safaribooksonline . com/?portal=ore±lly. 
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Introduction t9 dqta analysis 






Break it down^ 



Data is everywhere. 

Nowadays, everyone has to deal with mounds of data, whether they call themselves “data 
analysts” or not. But people who possess a toolbox of data analysis skills have a massive 
edge on everyone else, because they understand what to do with all that stuff. They know 
how to translate raw numbers into intelligence that drives real-world action. They know 
how to break down and structure complex problems and data sets to get right to the 
heart of the problems in their business. 


this is a new chapter 








sales are off 


Acme Cosmetics needs your help 

It’s your first day on the job as a data analyst, 
and you were just sent this sales data from the 
GEO to review. The data describes sales of 
Acme’s flagship moisturizer, Moisture Plus. 


*tKc Ust 晰 。灼七 vis wi*b^ sales? 


How do ihciv- gv-oss sales -figures 
dorwpav-c -to -theiv- sales -Piguvcs? 



September 

October 

November 

December 

January 

February 

Gross sales 

$5,280,000 

$5,501,000 

$5,469,000 

$5,480,000 

$5,533,000 

$5,554,000 

Target sales 

$5,280,000 

$5,500,000 

$5,729,000 

$5,968,000 

$6,217,000 

$6,476,000 








Ad costs 

$1,056,000 

$950,400 

$739,200 

$528,000 

$316,800 

$316,800 

Social network costs 

$0 

$105,600 

$316,800 

$528,000 

$739,200 

$739,200 








Unit prices (per oz.) 

$2.00 

$2.00 

$2.00 

$1.90 

$1.90 

$1.90 


Do you see a paltcv-h 
•rn htnxts expenses? 



lVha*t do you *tiVmk is ^o'm^ 
o 灼 wi*th these uh 七 pvi^cs? 

iVhy avc *thcy ^o'mg dow^? 


Take a look at the data. It’s fine not to 
know everything~just slow down and 
take a look. 

What do you see? How much does the 
table tell you about Acme’s business? 
About Acme’s MoisturePlus moisturizer? 


Gooct data analysts 
always want to see 
tke ctata* 
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The CEO wants data analysis 
to help increase sales 

He wants you to “give him an analysis •” 

It’s kind of a vague request, isn’t it? It sounds simple, 
but will your job be that straightforward? Sure, he 
wants more sales. Sure, he thinks something in the 
data will help accomplish that goal. But what, and 
how? 


tteve s *tKc CtO 



Wow what dould he meah by this? 




Think about what, fundamentally, the 
CEO is looking for from you with this 
question. When you analyze data, what 
are you doing? 







the steps of analysis 


Pata analysis is careful 
thinking about evidence 


The expression “data analysis” covers a lot of 
different activities and a lot of different skills. If 
someone tells you that she’s a data analyst, you still 
won’t know much about what specifically she knows or 
does. 


/ou rhight bet that she khows 
but tWs about it/ 



But all good analysts, regardless of their skills or goals, 
go through this same basic process during the 
course of their work, always using empirical evidence 
to think carefully about problems. 


Dc-Pmc you\r pv-oblcrw. 




Data analysis is all about 

bv-cakmg pv-oblcms 3 r\d 
data m*to smaller picdcs 


Aoy/i% youv- f\roblcm 
is i\\t vevy fas 七 stej ^： 


Wtrts i\\t <^f *tKc analysis, y/hcrc 
you dv~dv; youv* doi^dlusioy>s dbou*t 
you VC leaded m 七 V>c *bwo s*tcfs. 






Finally, you pu-t ii all ba^k ioythev- 
dhd rw^kc (ov 9 dedisioh- 


In every chapter of this book, you’ll go 
through these steps over and over again, and 
they’ll become second nature really quickly. 

Ultimately, all data analysis is designed to lead 
to better decisions, and you’re about to 
learn how to make better decisions by gleaning 
insights from a sea of data. 
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PcfiwG the problem 



Doing data analysis without explicitly 
defining your problem or goal is like heading 
out on a road trip without having decided on a 
destination. 

Sure, you might come across some interesting 
sights, and sometimes you might want to 
wander around in the hopes you’ll stumble 
on something cool, but who’s to say you’ll 
find anything? 


Ever seen an “analytical report” that’s a 

million pages long, with tons and tons 
of charts and diagrams? 

Every once in a while, an analyst really 
does need a ream of paper or an hour- 
long slide show to make a point. But 
in this sort of case, the analyst often 
hasn’t focused enough on his problem 
and is pelting you with information 
as a way of ducking his obligation to 
solve a problem and recommend a 
decision. 

Sometimes, the situation is even worse: 
the problem isn’t defined at all and the 
analyst doesn’t want you to realize that 
he’s just wandering around in the data. 


Wtrts, a analytical 代 fovt 


Road trip with a 
destination. 


Missioh a^omplished/ 


Road trip without a 
destination. 

l/VV^o kr\ows y/Vicv-c 

you II or\d uf? 


How cto you cteiine your protleiti? 
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what are you looking for? 


Your client will help you 
define your problem 

He is the person your analysis is meant to 
serve. Your client might be your boss, your 
company’s GEO, or even yourself. 

Your client is the person who will make 
decisions on the basis of your analysis. You 
need to get as much information as you can 
from him to define your problem. 

The CEO here wants more sales. But that’s 
only the beginning of an answer. You need to 
understand more specifically what he means 
in order to craft an analysis that solves the 
problem. 




BULLET POINTS 



This is youv 

you vc y/ovk'm^ *fov 


Its a really ^ood idea *to 
youv as y/cll as you 


Your client might be: 


■ 


■ 


well or badly informed 
about his data 


well or badly informed 
about his business 


■ well or badly informed 
about his problems or 
goals 


■ focused or indecisive 

■ clear or vague 

■ intuitive or analytic 



CB-0 M 你 c 

Cosmetics 


eye a*t bo*t*tom o( 七 he 
pay du\r'm^ 七 his di^ap*tcv* -fov* *thcsc 
C\acs, yj\\\c\\ sV>o>w you y/heve you av-c. 





The bettev- you u^dc\rstair>d youv dierrt, 七 he 
rwo\re likely you\r analysis will be able "to help. 






-► 
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introduction to data analysis 


theretare no o 

Dumb Questions 


I always like wandering around in data. Do you mean that 
I need to have some specific goal in mind before I even look at 
my data? 

You don't need to have a problem in mind just to look at data. 
But keep in mind that looking by itself is not yet data analysis. Data 
analysis is all about identifying problems and then solving them. 

I’ve heard about “exploratory data analysis,” where you 
explore the data for ideas you might want to evaluate further. 
There’s no problem definition in that sort of data analysis! 

Sure there is. Your problem in exploratory data analysis is to 
find hypotheses worth testing. That’s totally a concrete problem to 
solve. 

Fine. Tell me more about these clients who aren’t well 
informed about their problems. Does that kind of person even 
need a data analyst? 


Of course! 

Sounds to me like that kind of person needs professional 

help. 

Actually, good data analysts help their clients think through their 
problem; they don’t just wait around for their clients to tell them what 
to do. Your clients will really appreciate it if you can show them that 
they have problems they didn’t even know about. 

That sounds silly. Who wants more problems? 

People who hire data analysts recognize that people with 
analytical skills have the ability to improve their businesses. Some 
people see problems as opportunities, and data analysts who show 
their clients how to exploit opportunity give them a competitive 
advantage. 


Sharpen your pencil 




The general problem is that we need to increase sales. What 
questions would you ask the CEO to understand better what he 
means specifically? List five. 


O 

o 

o 

o 

o 


you are here ► 











response from the client 


Acme's CEO has some feedback for you 


Y^u\r «\ucstiohS mi( 


be 




This email just came through 
in response to your questions. 

Lots of intelligence here … 

Hcv-C 3V-C some sample <WCS*tioy>s 

b> *thc CB-0 {p dcVmc 

youv* a 灼 alybddl ^oals. 


Always ask “how nr\udh- J 
tAskt youv- ^oals a^d 
beliefs 


wha 七 youv d\cv\i 

"thmks about He’s dc-P'mi-tcly 30 … 3 

"to be CoY\Ctrv\td wiih dorwpeii-feov-s. 


See sowC-tWm^ 6u\nous 

•m 七 he ^uwbev-s? 

Ask about 'M 


From: CEO, Acme Cosmetics 
To: Head First 

Subject: Re: Define the problem 

By how much do you want to increase sales? 

I need to get it back in line with our target sales, 
which you can see on the table. All our budgeting is 
built around those targets, and we’ll be in trouble if 
we miss them. 

How do you think we’ll do it? 

Well, thafs your job to figure out. But the strategy 
is going to involve getting people to buy more, and 
by “people” I mean tween girls (age 11-15). You’re 
going to get sales up with marketing of some sort or 
another. YouVe the data person. Figure it out! 

How much of a sales increase do you think is feasible? 
Are the target sales figures reasonable? 

These tween girls have deep pockets. Babysitting 
money, parents, and so on. I don’t think there s 
any limit to what we can make off of selling them 
MoisturePlus. 

How are our competitors，sales? 

I don，t have any hard numbers, but my impression 
is that they are going to leave us in the dust. I’d say 
they 5 re 50-100 percent ahead of us in terms of gross 
moisturizer revenue. 

What，s the deal with the ads and the social networking 
marketing budget? 

We，re trying something new. The total budget is 
20 percent of our first month’s revenue. All of that 
used to go to ads, but we’re shifting it over to social 
networking. I shudder to think what’d be happening if 
we，d kept ads at the same level. 


Define 


♦ ■ Disassemble 


-► 


Evaluate 


-► 


Decide 
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introduction to data analysis 


freak the problem and 
data into smaller pieces 

The next step in data analysis is to take 
what you’ve learned about your problem 
from your client, along with your data, and 
break that information down into the level of 
granularity that will best serve your analysis. 

► I Evaluate ■ - ► 



Pivide the problem into smaller problems 


You need to divide your problem into manageable, 
solvable chunks. Often, your problem will be vague, 
like this: 


I s . i f |，， f Wliat Jo our test customers want from us?” 

dO UlCrCftSC • ff Wkat promotions are most likely to work?^ 

ff How is our advertising doing?” 


You can’t answer the big problem directly. But by 
answering the smaller problems, which you’ve analyzed 
out of the big problem, you can get your answer to the big 
one. 


T 

V, /\r>sy/CV smaller pvoblcms 

{p solve bi^cv- oi\t- 


Pivide the data into smaller chunks 

Same deal with the data. People aren’t going to 
present you the precise quantitative answers you 
need; you’ll need to extract important elements 
on your own. 

If the data you receive is a summary, like what 
you’ve received from Acme, you’ll want to know 
which elements are most important to you. 

If your data comes in a raw form, you’ll want t 
summarize the elements to make that data more 
useful. 



September 

October 

November 

December 

January 

February 

Gross sales 

$5,280,000 

$5,501,000 

$5,469,000 

$5,480,000 

$5,533,000 

$5,554,000 

Target sales 

$5,280,000 

$5,500,000 

^5^29 ; 000" 

、 $5,968,000 

$6,217,000 

$6,476,000 



x 





Ad costs 

$1,056,000 

$3^5,400 

$739,200 

$528,000 

$316,800 

$316,800 

Social network costs 

$0 

/l 05,600 

$316,800 

$528,000 

$739,200 

$739,200 



/ 





Unit prices (per oz.) 

$2.00/ 

f $2.00 

$2.00 

$1.90 

$1.90 

$1.90 


December Target Sales $5,968,000 

November Unit Prices $2.00 



These be 
the dhuhks you 
y\ttA -fco 



/\/Iovc ot\ *tKcsc buzz-y/ovds 
\ y \ 3 


Let’s give disassembling 
a shot... 
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revisit the data 


Now take another look 
at what you know 

Let’s start with the data. Here you have a 
summary of Acme’s sales data, and the 
best way to start trying to isolate the most 
important elements of it is to find strong 

comparisons. 


Break ctown your summary 
data searcking lor 
interesting comparisons. 


Wo^i do 七 he 5 V- 0 SS ar>d sales -fi^uircs 

dompav-c {jo eadii o-tKcv- -fov Od*tobc\r?p 


ttow do J 

匕 orwpa\re "to 


yross sales 
bo Fcbv-uav-vs? 




September 

October 

November 

December 

January 

February 

Gross sales 

$5,280,000 

$5,501,000 

$5,469,000 

$5,480,000 

$5,533,000 

$5,554,000 

Target sales 

$5,280,000 

$5,500,000 

$5,729,000 

$5,968,000 

$6,217,000 

$6,476,000 








Ad costs 

$1,056,000 

$950,400 

$739,200 

$528,000 

$316,800 

$316,800 

Social network costs 

$0 

$105,600 

$316,800 

$528,000 

$739,200 

$739,200 








Unit prices (per oz.) 

$2.00 

$2.00 

$2.00 

$1.90 

$1.90 

$1.90 



ttow a\rc ad ahd social hetwov-k tosis 
^ hah 9 ih 9 io oihcY' ovc^ tir^c? 


Poes *tV>c dedvease *m uki 七 fv-*idcs domdidc 
wrth dv>Y di^air> 5 C 5V-0SS sales? 


Making good comparisons is at the core of 
data analysis, and you’ll be doing it throughout 
this book. 

In this case, you want to build a conception 
in your mind of how Acme’s Moisture Plus 
business works by comparing their summary 
statistics. 



10 


Chapter 1 
























introduction to data analysis 


You’ve defined the problem: figure out how 
to increase sales. But that problem tells 
you very little about how you’re expected to do 
it, so you elicited a lot of useful commentary 
from the GEO. 


This commentary provides an important 
baseline set of assumptions about how 
the cosmetics business works. Hopefully, the 
GEO is right about those assumptions, because 
they will be the backbone of your analysis! 
What are the most important points that the 
CEO makes? 


丁 his ^orwrhCh-tairy is i-tscl-f a 

kihd J data. IVhidh pav-ts 

of it r^ost inr»f>ovtah-t? 


Wtrts the 

w how” 'ues*tio 灼 . 


mos*t usc-ful? 


From: CEO, Acme Cosmetics 
To: Head First 

Subject: Re: Define the problem 

By how much do you want to increase sales? 

I need to get it back in line with our target sales, 
which you can see on the table. All our budgeting is 
built around those targets, and we’ll be in trouble if 
we miss them. 

How do you think we’ll do it? 

Well, that's your job to figure out. But the strategy 
is going to involve getting people to buy more, and 
by “people” I mean tween girls (age 11-15). You’re 
going to got sales up with marketing of some sort or 
another. You're the data person. Figure it out! 

How much of a sales increase do you think is feasible? 
Are the target sales figures reasonable? 

These tween girls have deep pockets. Babysitting 
money, parents, and so on. I don’t think there's 
any limit to what we can make off of selling them 
MoisturePlus. 

How are our competitors' sales? 

I don't have any hard numbers, but my impression 
is that they are going to leave us in the dust. I’d say 
they're 50-100 percent ahead of us in terms of gross 
moisturizer revenue. 

What's the deal with the ads and the social networking 
marketing budget? 

We're trying something new. The total budget is 
20 percent of our first month's revenue. All of that 
used to go to ads, but we're shifting it over to social 
networking. I shudder to think what’d be happening 
if we'd kept ads at the same level. 


I^rpen your pencil 


Summarize what your client believes and your thoughts on the 
data you’ve received to do the analysis. Analyze the above email 
and your data into smaller pieces that describe your situation. 


Your client’s beliefs. 


Your thoughts on the data. 


o 

O 

o 


o 

o 

o 

o 


o 
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analyze the client 


Sharpen your pencil 


、 * Solution 


You just took an inventory of your and your client’s beliefs. What 
did you find? 


uv ovm a 灼 sv/evs 


Your client’s beliefs. 


_your ovm miw 

be sli^ivtly 


o Mois-turcPIus dus*bomc\rs arc *ty/cer\ girls (where *ty/ccr\s avc people a^ed 11 - 15 ). ThcyVc basically *thc 

ohly Customc\r youp. 

❺ is ou 七 rcalloda*tmj c^pchscs -from advcvtiscmch-ts bo sodidl hrtworkihj, bu*t so -far, *thc 

suddess *the initiative is uhkhowh. 


❺ Wt see Y\0 linr\i*t *bo fO*tcr\*tial sales jvowth *t>WCCh girls. 


今 ood … "this is 七 he soirt o( 七 hmg erne does hoy/dddys. 


o ^tn\cs dom^e*ti*to\rs arc oc-tremdy ddho^erous. 



"This to\j\d be wo\rth \rcmcrwbcv-ihg. 


Your thoughts on the data. 


o Sales arc slijlvtly u^> *m February *to September, bu*t k'md o-f -flat 

Big fv-oblcm - ^ 

❺ Sales are way oK -their -targets av\d be^ar\ diverj'moj m November. 


❺ Bd c%ytY\sts may have huvt Ac^ts abili-ty to keep wi*th sales *ta\rg,c*b. 

iVhat should they do ^cx.t? 

o {ht prides does ho*t seem *bo have helped sales keep ^>ade with -tav-jets. 


n/ouvc su^css^ully Wok ⑼ your 汽 oblem 
m-to smaller, more ma^ajcaklc 




/Vow it’s -time -to evaluate 
those pieces ih yccaicr detail … 


► ■ Evaluate ■ - ► 
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introduction to data analysis 


Evaluate the pieces 


Here comes the fun part. You know what you 
need to figure out, and you know what chunks 
of data will enable you to do it. Now, take a close, 
focused look at the pieces and form your own 
judgements about them. 




Just as it was with disassembly, the key to 
evaluating the pieces you have isolated is 

comparison. 

What do you see when you compare these 
elements to each other? 


Observations about the problem 

MoisturePlus customers are tween girls (where 
tweens are people aged 11-15). They’re basically 
the only customer group. 

Acme is trying out reallocating expenses from 
advertisements to social networking, but so far, 
the success of the initiative is unknown. 


P*itk ^ *bwo 
y-^dd "to 

iVhat do you see? 

Observations about the data 

Sales are slightly up in February 
compared to September, but kind of flat. 

Sales are way off their targets. 

Cutting ad expenses may have hurt Acme’s 
ability to keep pace with sales targets. 


We see no limit to potential sales 

growth among tween girls. Cutting the prices does not seem to have 

" helped sales keep pace with targets. 


Acme’s competitors are extremely dangerous. 



You have almost all the 
right pieces, but one 
important piece is missing... 
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you are your analysis 


Analysis begins whew you iwsert yourself 

Inserting yourself into your analysis means making 
your own assumptions explicit and betting your 
credibility on your conclusions. 

Whether you’re building complex models or making 
simple decisions, data analysis is all about you: your 
beliefs, your judgement, your credibility. 


Insert yoursell 


Good for your clients 

Your client will respect your judgments 
more. 

Your client will understand the 
limitations of your conclusions. 


Good for you 

You'll know what to look for in the data. 

You'll avoid overreaching in your 
conclusions. 

You'll be responsible for the success of 
your work. 


Youv* pv*ospcd*ts -fov* suddess rwudh bc*t*tcv* 
you art fa\rt <^f youv analysis. 


Don’t insert yoursell 

Bad for your client 

Your client won’t trust your analysis, 
because he won’t know your motives 
and incentives. 

Your client might get a false sense of 
“objectivity” or detached rationality. 


As you craft your final report, be sure to refer to yourself, 
so that your client knows where your conclusions are 
coming from. 

Define ■ - ► I Disassemble H— 


yikes/ You waht -fco 
^ •… "to these pv-oblcr^s. 


► ■ Evaluate - ► 




Bad for you 

You’ll lose track of how your baseline 
assumptions affect your conclusions. 

You’ll be a wimp who avoids 
responsibility! 
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introduction to data analysis 


Make a recommendation 

As a data analyst, your job is to empower yourself 
and your client to make better decisions, using 
insights gleaned from carefully studying your 
evaluation of the data. 



Making that happen means you have to package 
your ideas and judgments together into a format 
that can be digested by your client. 


That means making your work as simple as it can 
be, but not simpler! It’s your job to make sure 
your voice is heard and that people make 
good decisions on the basis of what you have to 
say. 


► Evaluate 


f[Y\ dialysis IS useless \AY\\ 

*iVs assembled m*to a -fovm 
七 ha 七 -fat'ilrtaics dcd*is*iov>s. 


The report you present to your client 
needs to be focused on making yourself 
understood and encouraging intelligent, 
data-based decision making. 



Sharpen your pencil 


Look at the information you’ve collected on the previous pages. 
What do you recommend that Acme does to increase sales? Why? 
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present your findings 


Your report is ready 


This is 七 he stu 以 
3 。 七 -fvom *tV>c CB-0 
a*t *tV>c 


tteve’s -the meat 
of youv analysis. 



You\r do 灼 dusio 灼 

山七 be 

:… 七 .^ 



Acme Cosmetics 
Analytical Report 


Context 


|*t,s a jood idea "to s-ta-tc 
lyouv- a^d you\r 
assump-tio^s \ y \ youv- vepovt- 


MoisturePlus customers are tween girls (where tweens are 
people aged 11-15). They’re basically the only customer group. 
Acme is trying out reallocating expenses from advertisements 
to social networking, but so far, the success of the initiative is 
unknown. We see no limit to potential sales growth among 
tween girls. Acme’s competitors are extremely dangerous. 

Interpretation of data 

Sales are slightly up in 
February compared to 
September, but kind of 
flat. Sales are way off their 
targets. Gutting ad expenses 
may have hurt Acme’s 
ability to keep pace with 
sales targets. Gutting the 
prices does not seem to 
have helped sales keep pace 
with targets. 

Recommendation 

It might be that the decline in sales relative to the target is 
linked to the decline in advertising relative to past advertising 
expenses. We have no good evidence to believe that social 
networking has been as successful as we had hoped. I will 
return advertising to September levels to see if the tween girls 
respond. Advertising to tween girls is the way to get 
gross sales back in line with targets. 



h simple JV-aphiC. 
{jo illus-tva-tc youir 
doir^dlusioir>. 



Wkat will tke 

CEO tkink? 






-► 
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introduction to data analysis 


The CEO likes your work 


Excellent work. Tm totally persuaded, 
ril execute the order for more ads at 
once. I can’t wait to see what happens! 


O 


Your report is concise, 
professional, and direct. 

It speaks to the CEO’s needs in a 
way that’s even clearer than his own 
way of describing them. 

You looked at the data, got greater 
clarity from the GEO, compared his 
beliefs to your own interpretation 
of his data, and recommended a 
decision. 

Nice work! 



How will your recommendation 
affect Acme’s business? 


Will Acme’s sales 
increase? 


you are here ► 
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surprise information 


Aw article just came across the wire 


Seems like S hide a\r 七 idle, 

ov\ the o-f it. 


Pataville business Paily 


MoisturePlus 
achieves complete 
market saturation 
among tween girls 

Our very own cosmetics 
industry analysts report 
that the tween girl 
moisturizer market is 
completely dominated by 
Acme Cosmetics’s flagship 
product, MoisturePlus. 
According to the DBD，s 
survey, 95 percent of tween 
girls report ’’Very Frequent ，， 
usage of MoisturePlus, 
typically twice a day or 
more. 


The Acme GEO was 
surprised when our 
reporter told him of 
our findings. M We are 
commited to providing 
our tween customers the 
most luxurious cosmetic 
experience possible at just- 
accessible prices，’，he said. 
，， I，m delighted to hear that 
MoisturePlus has achieved 
so much success with them. 
Hopefully, our analytical 
department will be able 
to deliver this information 
to me in the future, rather 
than the press.’’ 

Acme’s only viable 
competitor in this market 


space ， Competition 

Cosmetics, responded to 
our reporter’s inquiry saying, 
’’We have basically given up 
on marketing to tween girls. 
The customers that we 
recruit for viral marketing 
are made fun of by their 
friends for allegedly using 
a cheap, inferior product. 
The MoisturePlus brand is 
so powerful that it’s a waste 
of our marketing dollars 
to compete. With any luck, 
the MoisturePlus brand 
will take a hit if something 
happens like their celebrity 
endorsement getting caught 
on video having... 


does 七 his -fov youv* analysis? 


On the face of it, this sounds good for Acme. 
But if the market’s saturated, more ads to 
tween girls probably won’t do much good. 
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introduction to data analysis 


You're lucky I got this 
call. I canceled the tween 
girl ad campaign. Now come 
back to me with a plan that 
works. 


It’s hard to imagine the tween girl 
campaign would have worked. If the 

overwhelming majority of them are using 
MoisturePlus two or more times a day, what 
opportunity is there for increasing sales? 

You’ll need to find other opportunities for sales 
growth. But first, you need to get a handle on 
what just happened to your analysis. 






Somewhere along the way, you picked up some 
bad or incomplete information that left you 
blind to these facts about tween girls. What was 
that information? 
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analysis with bad models 


You let the CEO's beliefs take 
you down the wrowg path 

Here’s what the CEO said about how 
Moisture Plus sales works: 


The CEO’s beliefs about MoisturePlus 

MoisturePlus customers are tween girls (where tweens are people aged 
11—15). They’re basically the only customer group. 

Acme is trying out reallocating expenses from advertisements to social 
networking, but so far, the success of the initiative is unknown. 

We see no limit to potential sales growth among tween girls. 

Acme’s competitors are extremely dangerous. 



This is a 
mcrvtal model … 


Take a look at how these beliefs fit with the 
data. Do the two agree or conflict? Do they 
describe different things? 



September 

October 

November 

December 

January 

February 

Gross sales 

$5,280,000 

$5,501,000 

$5,469,000 

$5,480,000 

$5,533,000 

$5,554,000 

Target sales 

$5,280,000 

$5,500,000 

$5,729,000 

$5,968,000 

$6,217,000 

$6,476,000 








Ad costs 

$1,056,000 

$950,400 

$739,200 

$528,000 

$316,800 

$316,800 

Social network costs 

$0 

$105,600 

$316,800 

$528,000 

$739,200 

$739,200 








Unit prices (per oz.) 

$2.00 

$2.00 

$2.00 

$1.90 

$1.90 

$1.90 


The data doesn’t say anything about tween 
girls. He assumes that tween girls are the only buyers 
and that tween girls have the ability to purchase more 

In \\(^i article, you m 吵七 

y/a^*t bo v-casscss belirf S. 

WlcVc badk io 

v _ 

I - ► I - ► ^ - ► 
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introduction to data analysis 


Your assumptions and beliefs about 
the world are your mcwtal mod 


And in this case, it’s problematic. If the 
newspaper report is true, the CEO’s beliefs 
about tween girls are wrong. Those beliefs are 
the model you’ve been using to interpret the 
data. 



The world is complicated, so we use mental 
models to make sense of it. Your brain is like 
a toolbox, and any time your brain gets new 
information, it picks a tool to help interpret 
that information. 


Mental models can be hard-wired, innate 
cognitive abilities, or they can be theories that 
you learn. Either way, they have a big impact 
on how you interpret data. 



Sometimes mental models are a big help, 
and sometimes they cause problems. In this 
book, you’ll get a crash course on how to use 
them to your advantage. 

What’s most important for now is that 
you always make them explicit and give 
them the same serious and careful 
treatment that you give data. 


Always make your 
mental moctels as 
explicit as possible. 


a -tool -fco rwakc sc^sc 
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interaction of models 


Your statistical model depends 
oh your mental model 


Mental models determine what you see. 
They’re your lens for viewing reality. 



YouV* model IS 

like las you use 
*to v"ic>M 七 lie >wovld- 


You can’t see everything, so your brain has to 
be selective in what it chooses to focus your 
attention on. So your mental model largely 

determines what you see. 



Oi\t medial model Will dhra>w youv a 七七⑶七 10 灼 
-to some -fcaiuvcs o( *thc >wovld … 



...ahd a di-P-Pcvcht rncv\ia\ r^odel will dv-aw 
youm attchtioh -to othc\f -Pcatuv-cs. 



If you’re aware of your mental model, you’re 
more likely to see what’s important and 
develop the most relevant and useful statistical 
models. 


Your statistical model depends on your 
mental model. If you use the wrong mental 
model, your analysis fails before it even begins. 



You’ct Letter get tke 

mental moctel rig[kt! 
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introduction to data analysis 


Sharpen your pencil 


Let’s take another look at the data and think about what other 
mental models would fit the data. 



September 

October 

November 

December 

January 

February 

Gross sales 

$5,280,000 

$5,501,000 

$5,469,000 

$5,480,000 

$5,533,000 

$5,554,000 

Target sales 

$5,280,000 

$5,500,000 

$5,729,000 

$5,968,000 

$6,217,000 

$6,476,000 








Ad costs 

$1,056,000 

$950,400 

$739,200 

$528,000 

$316,800 

$316,800 

Social network costs 

$0 

$105,600 

$316,800 

$528,000 

$739,200 

$739,200 








Unit prices (per oz.) 

$2.00 

$2.00 

$2.00 

$1.90 

$1.90 

$1.90 



List some assumptions that would be true if MoisturePlus is 
actually the preferred lotion for tweens. 


Use youv dvcativrty! 




List some assumptions that would be true if MoisturePlus was 
in serious danger of losing customers to their competition. 
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meet model fitting 



You just looked at your summary data with a new perspective: 
how would different mental models fit? 



September 

October 

November 

December 

January 

February 

Gross sales 

$ 5 , 280,000 

$ 5 , 501,000 

$ 5 , 469,000 

$ 5 , 480,000 

$ 5 , 533,000 

$ 5 , 554,000 

Target sales 

$ 5 , 280,000 

$ 5 , 500,000 

$ 5 , 729,000 

$ 5 , 968,000 

$ 6 , 217,000 

$ 6 , 476,000 








Ad costs 

$ 1 , 056,000 

$ 950,400 

$ 739,200 

$ 528,000 

$ 316,800 

$ 316,800 

Social network costs 

$0 

$ 105,600 

$ 316,800 

$ 528,000 

$ 739,200 

$ 739,200 








Unit prices (per oz.) 

$ 2.00 

$ 2.00 

$ 2.00 

$ 1.90 

$ 1.90 

$ 1.90 


o List some assumptions that would be true if MoisturePlus is 
actually the preferred lotion for tweens. 

yy\s spchd almost dll *thciv- moistu\ri 2 ^\r dollars oy\ A1ois*tu\rcPlus. 
v\ttds to -fmd hew markers -for Mois-turcPIus to mdrease sales. 

ttcvVs 

d Thevc arc Y\o meahm^ful dompc*ti*bovs *to MoisturePlus. I^s by -far *thc bes-t product 

Social hc*t>wo\rks arc -the mos-t dos 七一 way -to sell -to people howaddys. 

Pride mdrcascs oy\ MoisturePlus y/ould \rcdudc market share. 

❺ List some assumptions that would be true if MoisturePlus was 
in serious danger of losing customers to their competition. 

7V/cer\ jirls shiftmg 七。 r\cw mois-tu\riz^\r produd*t) dhd ^iCrnt heeds *to badk. 

AloistuvcPIus is dohsidcv-cd u uhdool^ ar\d 1 jus-t -for dorks.^ 

TVis is a 丁 ^ ' s ^ m look is bedorm'm^ j>oJ>ula\r youhj people* 

dKallen^^ £^-,^1 ne-^y/ov-k is B bU^k hole, dnd we »eed "to bd£k -bo ads. 

TweCh yr\s av-c will*mc| bo spchd mudh more nr\Ohcy ov\ mois*tu\riz^\r. 

|*t’s y \ o {, u^usudl -fov you\r dierrt *to have *thc Completely v/vo^j 
merrtal model. I 的 i*t’s \rcdlly Common -fov people *to ig^ovc 

y/ha*t rwijh*t be *thc mos*t ir»\po\rta^*t pav-*t o-f *thc model... 
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introduction to data analysis 


Mental models should always 
include what you don't know 


Always specify uncertainty. If you’re explicit 
about uncertainty, you’ll be on the lookout for 
ways to use data to fill gaps in your knowledge, 
and you will make better recommendations. 

Thinking about uncertainties and blind spots 
can be uncomfortable, but the payoff is huge. 

This “anti-resume” talks about what someone 
doesn’t know rather than what they do know. 

If you want to hire a dancer, say, the dances 
they don’t know might be more interesting to 
you than the dances they do know. 

you V^i\rc people) Y ou 七亡於 
-f md out dor!i — 

k 灼 ow oir\ly s *too la 七 

It’s the same deal with data analysis. Being clear 
about your knowledge gaps is essential. 

Specify uncertainty up front, and you 
won’t get nasty surprises later on. 


This v/ould be a 
■fco wvi-tc- 


^Jharpen your pencil 



Head First Anti-Resume 

Experiences I haven’t had: 

Being arrested 
Eating crawfish 
Riding a unicycle 
Shoveling snow 

Things I don’t know: 

The first fifty digits of Pi 

How many mobile minutes I used today 

The meaning of life 

Things I don’t know how to do: 

Make a toast in Urdu 
Dance merengue 
Shred on the guitar 

Books I haven’t read: 


What questions would you ask the CEO to find out what he 

doesn't know? 


you are here ► 
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the client confides 


The CEO tells you what 
he doesn't know 


From: CEO, Acme Cosmetics 
To: Head First 

Subject: Re: Managing uncertainty 

Where would you say are the biggest gaps in your 
knowledge about MoisturePlus sales? 

Well that，s an interesting question. I’d always 
thought we really understood how customers felt 
about our product. But since we don’t sell direct to 
consumers, we really don’t know what happens after 
we send our product to our resellers. So, yeah, we 
don，t really know what happens once MoisturePlus 
leaves the warehouse. 

How confident are you that advertising has increased 
sales in the past? 

Well, like they always say, half of it works, half of it 
doesn ， t，and you never know which half is which. 

But it，s pretty clear that the MoisturePlus brand is 
most of what our customers are buying, because 
MoisturePlus isn，t terribly different from other 
moisturizers, so ads are key to establishing the 

brand. 

Who else might buy the product besides tween girls? 

I just have no idea. No clue. Because the product 
is so brand-driven we only think about tween girls. 
We，ve never reached out to any other consumer 

group. 

Are there any other lingering uncertainties that I should 
know about? 

Sure, lots. You，ve scared the heck out of me. I don’t 
feel like I know anything about my product any more. 
Your data analysis makes me think I know less than I 

ever knew. 


I 七 ’s *to yt 七 he 

dien 七 * to spcdula*tc- 


Moi a lo*t o-f dev-tamiy 
hcv-c oy\ hoy/ well 
advcvtis'm^ v/ovks. 


This is a big blihd spot/ 


r ho else might be 
luying MoisturePlus? 

ire there other buyers besides 
[ween girls? 


Define 


-► 


Disassemble 
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introduction to data analysis 


That’s a funny thing the CEO said 
at the end: data analysis makes you feel 
like you know less. He’s wrong about that, 
right? 

It depends on how you look at it. 
Nowadays, more and more problems can 
be solved by using the techniques of data 
analysis. These are problems that, in the 
past, people would solve using gut instincts, 
flying by the seat of their pants. 

So mental models feel more and 
more flimsy compared to how they felt in 
the past? 

A lot of what mental models do is help 
you fill in the gaps of what you don’t know. 
The good news is that the tools of data 
analysis empower you to fill those gaps in 
a systematic and confidence-inspiring way. 

So the point of the exercise of specifying 
your uncertainty in great detail is to help you 
see the blind spots that require hard-nosed 
empirical data work. 



Dumb Questi9ns 

But won’t I always need to use 
mental models to fill in the gaps of 
knowledge in how I understand the 
world? 

Absolutely... 

Because even if I get a good 
understanding of how things work right 
now, ten minutes from now the world will 
be different. 

That’s exactly right. You can’t know 
everything, and the world’s constantly 
changing. That’s why specifying your 
problem rigorously and managing the 
uncertainties in your mental model is so 
important. You have only so much time 
and resources to devote to solving your 
analytical problems, so answering these 
questions will help you do it efficiently and 
effectively. 

Does stuff you learn from your 
statistical models make it into your 
mental models? 


Definitely. The facts and phenomena 
you discover in today’s research often 
become the assumptions that take you into 
tomorrow’s research. Think of it this way: 
you'll inevitably draw wrong conclusions from 
your statistical models. Nobody’s perfect. 

And when those conclusions become part of 
your mental model, you want to keep them 
explicit, so you can recognize a situation 
where you need to double back and change 
them. 

So mental models are things that 
you can test empirically? 

Yes, and you should test them. You 
can’t test everything, but everything in your 
model should be testable. 

How do you change your mental 
model? 

A- 

You’re about to find out... 


Tke CEO orcterect more ctata 
to kelp you look lor market 
segments tesictes tween g[irls. 
Let’s take a look. 
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copious data incoming 


Acme just scwt you a huge list of raw data 


When you get new data, and you haven’t done 
anything to change it yet, it’s considered raw 

data. You willl almost always need to 

manipulate data you get from someone 
else in order to get it into a useful form for the 
number crunching you want to do. 

Just be sure to save your originals. 

And keep them separate from any data 
manipulation you do. Even the best analysts 
make mistakes, and you always need to be able 
to compare your work to the raw data. 


TV^'is is a lot <^f stud. 
w>dybe w>o\rc need. 



O 


Thafs sooo much 
data! What do I do? 
Where do I begin? 


Date 

Vendor 

Lot size 
(units) 

Shipping 

ZIP 

Cost 

9/1/08 

Sassy Girl Cosmetics 

5253 

20817 

$75,643 

9/3/08 

Sassy Girl Cosmetics 

6148 

20817 

$88,531 

9/4/08 

Prissy Princess 

8931 

20012 

$128,606 

9/14/08 

Sassy Girl Cosmetics 

2031 

20817 

$29,246 

9/14/08 

Prissy Princess 

8029 

20012 

$115,618 

9/15/08 

General American Wholesalers 

3754 

20012 

$54,058 

9/20/08 

Sassy Girl Cosmetics 

7039 

20817 

$101,362 

9/21/08 

Prissy Princess 

7478 

20012 

$107,683 

9/25/08 

General American Wholesalers 

2646 

20012 

$38,102 

9/26/08 

Sassy Girl Cosmetics 

6361 

20817 

$91,598 

10/4/08 

Prissy Princess 

9481 

20012 

$136,526 

10/7/08 

General American Wholesalers 

8598 

20012 

$123,811 

10/9/08 

Sassy Girl Cosmetics 

6333 

20817 

$91,195 

10/12/08 

General American Wholesalers 

4813 

20012 

$69,307 

10/15/08 

Prissy Princess 

1550 

20012 

$22,320 

10/20/08 

Sassy Girl Cosmetics 

3230 

20817 

$46,512 

10/25/08 

Sassy Girl Cosmetics 

2064 

20817 

$29,722 

10/27/08 

General American Wholesalers 

8298 

20012 

$119,491 

10/28/08 

Prissy Princess 

8300 

20012 

$119,520 

11/3/08 

General American Wholesalers 

6791 

20012 

$97,790 

11/4/08 

Prissy Princess 

3775 

20012 

$54,360 

11/10/08 

Sassy Girl Cosmetics 

8320 

20817 

$119,808 

11/10/08 

Sassy Girl Cosmetics 

6160 

20817 

$88,704 

11/10/08 

General American Wholesalers 

1894 

20012 

$27,274 

11/15/08 

Prissy Princess 

1697 

20012 

$24,437 

11/24/08 

Prissy Princess 

4825 

20012 

$69,480 

11/28/08 

Sassy Girl Cosmetics 

6188 

20817 

$89,107 

11/28/08 

General American Wholesalers 

4157 

20012 

$59,861 

12/3/08 

Sassy Girl Cosmetics 

6841 

20817 

$98,510 

12/4/08 

Prissy Princess 

7483 

20012 

$107,755 

12/6/08 

General American Wholesalers 

1462 

20012 

$21,053 

12/11/08 

General American Wholesalers 

8680 

20012 

$124,992 

12/14/08 

Sassy Girl Cosmetics 

3221 

20817 

$46,382 

12/14/08 

Prissy Princess 

6257 

20012 

$90,101 

12/24/08 

General American Wholesalers 

4504 

20012 

$64,858 

12/25/08 

Prissy Princess 

6157 

20012 

$88,661 

12/28/08 

Sassy Girl Cosmetics 

5943 

20817 

$85,579 

1/7/09 

Sassy Girl Cosmetics 

4415 

20817 

$63,576 

1/10/09 

Prissy Princess 

2726 

20012 

$39,254 

1/10/09 

General American Wholesalers 

4937 

20012 

$71,093 

1/15/09 

Sassy Girl Cosmetics 

9602 

20817 

$138,269 

1/18/09 

General American Wholesalers 

7025 

20012 

$101,160 

1/20/09 

Prissy Princess 

4726 

20012 

$68,054 


o 


A lot of data is usually a good thing. 

V Just stay focused on what you’re trying to 
* 1V 丄 accomplish with the data. If you lose track of 

your goals and assumptions, it’s easy to get “lost 
messing around with a large data set. But good data analysis is all 
about keeping focused on what you want to learn about the data. 



-► 



-► 



-► 


I Decide~~[ 


28 Chapter 1 
























































introduction to data analysis 




ExeitciSe 


Take a close look at this data and think about the CEO’s mental model. 
Does this data fit with the idea that the customers are all tween girls, or 
might it suggest other customers? 



youv Kcv-c- 


Date 

Vendor 

Lot size 
(units) 

Shipping 

ZIP 

Cost 

9/1/08 

Sassy Girl Cosmetics 

5253 

20817 

$75,643 

9/3/08 

Sassy Girl Cosmetics 

6148 

20817 

$88,531 

9/4/08 

Prissy Princess 

8931 

20012 

$128,606 

9/14/08 

Sassy Girl Cosmetics 

2031 

20817 

$29,246 

9/14/08 

Prissy Princess 

8029 

20012 

$115,618 

9/15/08 

General American Wholesalers 

3754 

20012 

$54,058 

9/20/08 

Sassy Girl Cosmetics 

7039 

20817 

$101,362 

9/21/08 

Prissy Princess 

7478 

20012 

$107,683 

9/25/08 

General American Wholesalers 

2646 

20012 

$38,102 

9/26/08 

Sassy Girl Cosmetics 

6361 

20817 

$91,598 

10/4/08 

Prissy Princess 

9481 

20012 

$136,526 

10/7/08 

General American Wholesalers 

8598 

20012 

$123,811 

10/9/08 

Sassy Girl Cosmetics 

6333 

20817 

$91,195 

10/12/08 

General American Wholesalers 

4813 

20012 

$69,307 

10/15/08 

Prissy Princess 

1550 

20012 

$22,320 

10/20/08 

Sassy Girl Cosmetics 

3230 

20817 

$46,512 

10/25/08 

Sassy Girl Cosmetics 

2064 

20817 

$29,722 

10/27/08 

General American Wholesalers 

8298 

20012 

$119,491 

10/28/08 

Prissy Princess 

8300 

20012 

$119,520 

11/3/08 

General American Wholesalers 

6791 

20012 

$97,790 

11/4/08 

Prissy Princess 

3775 

20012 

$54,360 

11/10/08 

Sassy Girl Cosmetics 

8320 

20817 

$119,808 

11/10/08 

Sassy Girl Cosmetics 

6160 

20817 

$88,704 

11/10/08 

General American Wholesalers 

1894 

20012 

$27,274 

11/15/08 

Prissy Princess 

1697 

20012 

$24,437 

11/24/08 

Prissy Princess 

4825 

20012 

$69,480 

11/28/08 

Sassy Girl Cosmetics 

6188 

20817 

$89,107 

11/28/08 

General American Wholesalers 

4157 

20012 

$59,861 

12/3/08 

Sassy Girl Cosmetics 

6841 

20817 

$98,510 

12/4/08 

Prissy Princess 

7483 

20012 

$107,755 

12/6/08 

General American Wholesalers 

1462 

20012 

$21,053 

12/11/08 

General American Wholesalers 

8680 

20012 

$124,992 

12/14/08 

Sassy Girl Cosmetics 

3221 

20817 

$46,382 

12/14/08 

Prissy Princess 

6257 

20012 

$90,101 

12/24/08 

General American Wholesalers 

4504 

20012 

$64,858 

12/25/08 

Prissy Princess 

6157 

20012 

$88,661 

12/28/08 

Sassy Girl Cosmetics 

5943 

20817 

$85,579 

1/7/09 

Sassy Girl Cosmetics 

4415 

20817 

$63,576 

1/10/09 

Prissy Princess 

2726 

20012 

$39,254 

1/10/09 

General American Wholesalers 

4937 

20012 

$71,093 

1/15/09 

Sassy Girl Cosmetics 

9602 

20817 

$138,269 

1/18/09 

General American Wholesalers 

7025 

20012 

$101,160 

1/20/09 

Prissy Princess 

4726 

20012 

$68,054 
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scrutinize the data 



What did you see in the data? Is the CEO right that only tween girls purchase MoisturePlus, or 
might there be someone else? 


These dompa 的 ies sou 的 d like 
七 hey sell to -twee^ givls. 



Wt C^y\ dcrtamly see 七 is sellmjj 七。 dompa^ics 
oh *bo sell *bo your\j|ev ^ivls. Sassy ^irl Cosmc*ti^s 
dhd Prissy P\r*mdcss dc-f'mi-tcly seem *to -fit 七 he bill. 

Bu 七七 here’s reseller oy\ {ht list: u ^ChC\ral 

/\i^cridar\ l/Vholcsalcrs/ ; The alo^e does^-t say who 
i*b dierrts arc, bu*t i 七 七 be y/o\rth vescavdhih^. 


1/Vlio av-c tiicsc people? 



Date 

Vendor 

Lot size 
(units) 

Shipping 

ZIP 

Cost 

9/1/08 

Sassy Girl Cosmetics 

5253 

20817 

$75,643 

9/3/08 

Sassy Girl Cosmetics 

6148 

20817 

$88,531 

9/4/08 

Prissy Princess 

8931 

20012 

$128,606 

9/14/08 

Sassy Girl Cosmetics 

2031 

20817 

$29,246 

9/14/08 

Prissy Princess 

8029 

20012 

$115,618 

9/15/08 

General American Wholesalers 

3754 

20012 

$54,058 

9/20/08 

Sassy Girl Cosmetics 

7039 

20817 

$101,362 

9/21/08 

Prissy Princess 

7478 

20012 

$107,683 

9/25/08 

General American Wholesalers 

2646 

20012 

$38,102 

9/26/08 

Sassy Girl Cosmetics 

6361 

20817 

$91,598 

10/4/08 

Prissy Princess 

9481 

20012 

$136,526 

10/7/08 

General American Wholesalers 

8598 

20012 

$123,811 

10/9/08 

Sassy Girl Cosmetics 

6333 

20817 

$91,195 

10/12/08 

General American Wholesalers 

4813 

20012 

$69,307 

10/15/08 

Prissy Princess 

1550 

20012 

$22,320 

10/20/08 

Sassy Girl Cosmetics 

3230 

20817 

$46,512 

10/25/08 

Sassy Girl Cosmetics 

2064 

20817 

$29,722 

10/27/08 

General American Wholesalers 

8298 

20012 

$119,491 

10/28/08 

Prissy Princess 

8300 

20012 

$119,520 

11/3/08 

General American Wholesalers 

6791 

20012 

$97,790 

11/4/08 

Prissy Princess 

3775 

20012 

$54,360 

11/10/08 

Sassy Girl Cosmetics 

8320 

20817 

$119,808 

11/10/08 

Sassy Girl Cosmetics 

6160 

20817 

$88,704 

11/10/08 

General American Wholesalers 

1894 

20012 

$27,274 

11/15/08 

Prissy Princess 

1697 

20012 

$24,437 

11/24/08 

Prissy Princess 

4825 

20012 

$69,480 

11/28/08 

Sassy Girl Cosmetics 

6188 

20817 

$89,107 

11/28/08 

General American Wholesalers 

4157 

20012 

$59,861 

12/3/08 

Sassy Girl Cosmetics 

6841 

20817 

$98,510 

12/4/08 

Prissy Princess 

7483 

20012 

$107,755 

12/6/08 

General American Wholesalers 

1462 

20012 

$21,053 

12/11/08 

General American Wholesalers 

8680 

20012 

$124,992 

12/14/08 

Sassy Girl Cosmetics 

3221 

20817 

$46,382 

12/14/08 

Prissy Princess 

6257 

20012 

$90,101 

12/24/08 

General American Wholesalers 

4504 

20012 

$64,858 

12/25/08 

Prissy Princess 

6157 

20012 

$88,661 

12/28/08 

Sassy Girl Cosmetics 

5943 

20817 

$85,579 

1/7/09 

Sassy Girl Cosmetics 

4415 

20817 

$63,576 

1/10/09 

Prissy Princess 

2726 

20012 

$39,254 

1/10/09 

General American Wholesalers 

4937 

20012 

$71,093 

1/15/09 

Sassy Girl Cosmetics 

9602 

20817 

$138,269 

1/18/09 

General American Wholesalers 

7025 

20012 

$101,160 

1/20/09 

Prissy Princess 

4726 

20012 

$68,054 
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introduction to data analysis 


Time to drill further iwto the data 


You looked at the mass of data with a 
very clear task: find out who’s buying 
besides tween girls. 

You found a company called General 
American Wholesalers. Who are they? 
And who’s buying from them? 


Sassy Girl 
Cosmetics 





Acme 

== 




General 

American 

Wholesalers 


B 


li 


sells to 


Tween girls 





sells -fco 


♦ 







It’s a good idea -to 
label you\r 3\r\rov/s^ 


Excise 


At Acme’s request, General American Wholesalers sent over this breakdown of their customers 
for MoisturePlus. Does this information help you figure out who’s buying? 

do^v\ *t^is dd^bd "tells you 
about who’s buy'mj Moistuv-cPIus. 


GAW vendor breakdown for six months ending 2/2009 
MoisturePlus sales only 


Vendor 

Units 

% 

Manly Beard Maintenance, Inc. 

9785 

23 % 

GruffCustomer.com 

20100 

46 % 

Stu's Shaving Supply LLC 

8093 

19 % 

Cosmetics for Men, Inc. 

5311 

12 % 




Total 

43289 

1 00 % 
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ideas validated 




(U：iSe 

utiort 


What did General American Wholesaler’s vendor list tell you about who’s buying MoisturePlus? 


GAW vendor breakdown for six months ending 2/2009 
MoisturePlus sales only 


Vendor 

Units 

% 

Manly Beard Maintenance, Inc. 

9785 

23 % 

GruffCustomer.com 

20100 

46 % 

Stu's Shaving Supply LLC 

8093 

19 % 

Cosmetics for Men, Inc. 

5311 

12 % 




Total 

43289 

1 00 % 


|*t looks like vacv\ av-e buy*m^ A1ois*tuv-ePlus| 
Look'm^ a*t vendor lis*t> 

you dould^*t *tdl *thcv-c wcv-c 
buy'm^. But ^e^ev-al iVholcsalcrs 

is rcscllmj Mois*turcPlus *to supply 

vchdo\rs| 


General American Wholesalers 


confirms your impression 


Yeah, the old guys like it, too, even though 
they're embarrassed that if s a tween 
product. It*s great for post-shave skin 
conditioning. 


This could be huge. 


It looks like there’s a whole group of people out 
there buying MoisturePlus that Acme hasn’t 
recognized. 

With any luck, this group of people could be 
where you have the potential to grow Acme’s sales. 
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introduction to data analysis 


I*m intrigued. This intelligence might bring about 
a huge shift in how we do business. Could you just 
walk me through how you came to this conclusion? 
And what should we do with this new information? 


O 


You’ve made it to the final 
stage of this analysis. 

It’s time to write your report. Remember, 
walk your client through your thought 
process in detail. How did you come to 
the insights you’ve achieved? 

Finally, what do you suggest that he do to 
improve his business on the basis of your 
insights? How does this information help 

him increase sales? 


^Sharpen your pencil 





How has the mental model changed? 


What evidence led you to your conclusion? 
Do you have any lingering uncertainties? 
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communicate with your client 


Sharpen your pencil 
、 Solution 


K 


How did you recap your work, and what do you recommend that 
the CEO do in order to increase sales? 


I s*tavtcd o-f-f *tv-ymo| *to -figure ou 七 Kow *bo mdrease sales *bo girls, because y/e believed 七 Iwt 

those yr\s were Moistuv-cPIus^s sole dlierrt base- l/Vhch we discovered *tha*t *the 'tween yr\ ^arke*t 
was saturated, I du^ deepev- m*to -the daia -to look -for souv-des o-f mdreased sales. \y\ -the frodcss, | 
*thc r^ch-tal model. Tuv-^s out -thcv-c are more people *thah we realized who are ch*thusias*tid 

about our p\rodud*t—espedially older Sihdc *this ^\roup o-f dus*tonr\C\rs is <^uie*t dbou*b *theiv 
errthusiasm -for *the pv-odud*t> I redone*tha*t wc mdrease our advc\rtis*mOj *to *thcr^ dramatically) 

sell'm^ -the sa^e pv-odud*t with a more 你伙一 friendly label This y/ill mdrease sales. 


tKereiare no o 

Dumb Questions 


If I have to get more detailed data to answer my questions, 
how will I know when to stop? Do I need to go as far as 
interviewing customers myself? 

How far to go chasing new and deeper data sources is 
ultimately a question about your own best judgement. In this case, 
you searched until you found a new market segment, and that was 
enough to enable you to generate a compelling new sales strategy. 
Well talk more about when to stop collecting data in future chapters. 

Is seems like getting that wrong mental model at the 
beginning was devastating to the first analysis I did. 

Yeah, getting that assumption incorrect at the beginning 
doomed your analysis to the wrong answers. That’s why it’s so 
important to make sure that your models are based on the right 
assumptions from the very beginning and be ready to go back and 
refine them as soon as you get data that upsets your assumptions. 



Does analysis ever stop? I’m looking for some finality 

here. 

You certainly can answer big questions in data analysis, but 
you can never know everything. And even if you knew everything 
today, tomorrow would be different. Your recommendation to sell 
to older men might work today, but Acme will always need analysts 
chasing sales. 

Sounds depressing. 

On the contrary! Analysts are like detectives, and there are 
always mysteries to be solved. That's what makes data analysis so 
much fun! Just think of going back, refining your models, and looking 
at the world through your new models as being a fundamental part of 
your job as data analyst, not an exception the rule. 
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introduction to data analysis 


Here's what you did 

Here’s one last look at the steps you’ve gone 
through to reach your conclusion about how to 
increase the sales of Acme’s Moisture Plus. 


/ou jot tlavi-Pida-tio^ av\d 如 summanz^d you 
data -fv-om -the CBO. a useful : format 


You su^cs-tcd *b^a*b mtv-cas'mj 
ads *to mi# 七 

sdlcs bdtk \ y \ l»^c- 


I Define ~[ - 



Yo\a looked ai youv 
3\rcds o-f ui^ev-taiivty. 


V^>u ^ormf>3\rcd ihc clcr^Chts 
o*P you\r summSk-y. 







I Evaluate~~[ - 


_Thci^ *twcc^ mr\a\rkc*t v-cpo\rt 

C you\r mental model. 



Y® u ^ollc^-tcd 
data about 

/Wois-tuircPIus dus-tomev-s. 


You discovered oldcv- 晰⑼ 
dmo^ Mo'is-tuv-cPIus buyers 




-► 



Yo\a YtCon\tf\ti\AtA 
mdv-cas*mj rwav-ketmg 
"to oldc\r rwCh. 
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smart analysis smart decisions 


Your analysis led your client 
to a brilliant decision 

After he received your report, the GEO quickly 
mobilized his marketing team and created a 
SmoothLeather brand moisturizer, which is 
just MoisturePlus under a new name. 

Acme immediately and aggressively marketed 
SmoothLeather to older men. Here’s what 
happened: 


Sales 





Sales took off! Within two months sales figures 
had exceeded the target levels you saw at the 
beginning of the chapter. 

Looks like your analysis paid off! 
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2 expetfmertts 

餐 




• Test your theories 



Can you show what you believe? 

In a real empirical test? There’s nothing like a good experiment to solve your problems 
and show you the way the world really works. Instead of having to rely exclusively on 
your observational data, a well-executed experiment can often help you make causal 
connections. Strong empirical data will make your analytical judgments all the more 
powerful. 


this is a new chapter 


starbuzz needs you 


Ifsa coffee recession! 

Times are tough, and even Starbuzz Coffee 
has felt the sting. Starbuzz has been the place 
to go for premium gourmet coffee, but in 
the past few months, sales have plummeted 



The Starbuzz CEO has called you in to help 
figure out how to get sales back up. 
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experiments 


The Starbuzz board 
meeting is m three months 

That’s not a lot of time to pull a turnaround 
plan together, but it must be done. 

We don’t totally know why sales are down, but 
we’re pretty sure the economy has something 
to do with it. Regardless, you need to figure 

out how to get sales back up. 

What would you do for starters? 


yikes/ 


From: CEO, Starbuzz 
To: Head First 

Subject: Fwd: Upcoming board meeting 
Did you see this?!? 

From: Chairman of the Board, Starbuzz 
To: CEO 

Subject: Upcoming board meeting 

The board is expecting a complete 
turnaround plan at the next meeting. 
We，re sorely disappointed by the sales 
decline. 

If your plan for getting numbers back 
up is insufficient, well be forced to 
enact our plan, which first involve the 
replacement of all high-level staff. 

Thanks. 


^harp€n your pencil 


Interview the CEO to figure out 
how Starbuzz works as a business. 


Take a look at the following options. Which do you think would 
be the best ways to start? Why? 

Interview the Chairman 
of the Board 


Do a survey of customers to find 
out what they’re thinking. 


Pour yourself a tall, steamy mug 
of Starbuzz coffee. 


Find out how the projected sales 
figures were calculated. 





^vi*tc m blanks you *thmk 
about tat\\ o( -these options. 
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random surveys 


Sharpen your pencil 
K Solution 


Where do you think is the best place to start figuring out how to 
increase Starbuzz sales? 


Interview the CEO to figure out 
how Starbuzz works as a business. 


Interview the Chairman of the 
Board 


Dc-f*mi*tdy d o^oodi pladc *to star 七 . ttcll have all ou *t OY\ a limb here. Your dierrt is really 

so\rts o-f abou*b business. CEft dhd over his hedd is didey. 


Do a survey of customers to find 
out what they’re thinking. 

This would also be ^ood- You II have *bo yt mside 

•their heads -bo yt *bo buy more dof-fee. 

Find out how the projected sales 
figures were calculated. 

This would be {o khow, bu*t its 

probably ho 七 *thc -first you’d look a*t*. 


Pour yourself a tall, steamy mug 
of Starbuzz coffee. 

S*ta\rbuzi is av/^fully -tasty. Why r^o-t have a dup? 


I like the idea of 
looking at our surveys. 
Give them a gander and 
tell me what you see. 


O 


Marketing runs surveys monthly. 


They take a random, representative sample 
of their coffee consumers and ask them a 
bunch of pertinent questions about how they 
feel about the coffee and the coffee-buying 
experience. 


*tV>a 七 v/ovd! 


What people say in surveys does not always fit 
with how they behave in reality, but it never 
hurts to ask people how they feel. 
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experiments 


The Starbuzz Survey 

Here it is: the survey the marketing 
department administers monthly to a large 
sample of Starbuzz customers. 


yOuVc d Stav-buzz. dus-torwev-, 
s d ^ood someone will 

ha^d you oy\C o( -these -to -fill oui- 


Starbuzz Survey 

Thank you for filling out our Starbuzz survey! Once you’re finished, 
our manager will be delighted to give you a $10 gift card for use at any 
Starbuzz location. Thank you for coming to Starbuzz! 

January 2009 


Date 

Starbuzz store # 


04524 


Circle the number that corresponds to how you feel about each 
statement. 1 means strongly disagree, 5 means strongly agree. 


“Starbuzz coffee stores are located conveniently for me. 

12 3 4 

“My coffee is always served at just the right temperature.’ 

1 2 3 Q) 




5 


“Starbuzz employees are courteous and get me my drink quickly. 

1 2 3 4 


“I think Starbuzz coffee is a great value.” 

1 Q 3 


4 


5 


u 


Starbuzz is my preferred coffee destination. 

1 2 3 


55 


4 




f\ store mcay>s you ayrtt s*t\roy>5ly. 

TK*IS tusiomcv^ vcally fvc-fc\rs S*ta\rbuzz. 


How would you summarize 
tkis survey data? 
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method of comparison 


Always use the method of comparison 

One of the most fundamental principles 
of analysis and statistics is the method 
of comparison, which states that data is 
interesting only in comparison to other data. 


Starbuzz Survey 


In this case, the marketing department takes 
the average answer for each question and 
compares those averages month by month. 

Each monthly average is useful only when you 
compare it to numbers from other months. 


Statistids av-c illu 眯 ma 七 

cmly ’m \rcla*tior\ *to 


Starbuzz Survey 

Thank you for rilling out our Starbuzz survey! Onrp vnuVe finish pH, 


our manage 
Starbuzz lo( 


Date 

Starb 

Circ] 

state 

“Starb 


“My 


“Starb 


“I thin 


“Starb 


Starbuzz Survey 

Thank you for filling out nnr 


Starbuzz survey! Once vou’re finished. 


Starbuzz Survey 

Thank you for filling out our Starbuzz survey! Once you 5 re finished, 
our manager will be delighted to give you a $10 gift card for use at Iny 
Starbuzz location. Thank you for coming to Starbuzz! 

Date January 2009 

Starbuzz store # 04524 

Circle the number that corresponds to how you feel about each 
statement. 1 means strongly disagree, 5 means strongly agree. 

“Starbuzz coffee stores are located conveniently for me.” 

1 2 3 4 @ 

“My coffee is always served at just the right temperature. 55 

1 2 3 Q) 5 

“Starbuzz employees are courteous and get me my drink quickly •” 

1 2 3 4 

“I think Starbuzz coffee is a great value.” 

1 3 4 5 

“Starbuzz is my preferred coffee destination.” 

1 2 3 4 


ch 


Here’s a summary of marketing surveys for the 6 months endingjanuary 2009. The figures 
represent the average score given to each statement by survey respondents from participating 
stores. 



Aug-08 

Sept-08 

Oct-08 

Nov-08 

Dec-08 

Jan-09 

Location convenience 

4.7 

4.6 

4.7 

4.2 

4.8 

4.2 

Coffee temperature 

4.9 

4.9 

4.7 

4.9 

4.7 

4.9 

Courteous employees 

3.6 

4.1 

4.2 

3.9 

3.5 

4.6 

Coffee value 

4.3 

3.9 

3.7 

3.5 

3.0 

2.1 

Preferred destination 

3.9 

4.2 

3.7 

4.3 

4.3 

3.9 

外 ^_ _ —— _ - ^ 

Participating stores 

iop 

101 

99 


101 

100 



TKc a 灼或 vs *to 

七 he 'ues 七 I 。灼 s 
art all avev-ajed 
by\A youped 
*m*to -table- 


TVis humbc\T is Ohly useful wkeh 
you it -to these humbev*s. 



Always make comparisons 
explicit. 

1 * 11 If a statistic seems interesting or useful, 

W4Lvn 11* y 0U neec j to explain why in terms of 

how that statistic compares to others. 

If you’re not explicit about it, you 1 re assuming that 
your client will make the comparison on their own, 
and that’s bad analysis. 


42 Chapter 2 
















































experiments 


Comparisons arc key for 
observational data 


The more comparative the analysis 
is, the better. And this is true especially 
in observational studies like the 
analysis of Starbuzz’s marketing data. 

In observational data, you just watch people 
and let them decide what groups they belong 
to, and taking an inventory of observational 
data is often the first step to getting better 
data through experiments. 


^v-oufs o( people be W b'i^ 

sfe^dev-s/^ w *tca dv-mkevs/ 1 c*td. 



^ ov\ the othc\r you 

whidh groups people go ih-to. 


Soholafs Covnev 



Obscv-va*tioiaal study A study wheve *thc 

people described decide ov\ "thciV' 
ovm v/iiidii groups -they beloh^ *fco. 



Look at the survey data on the facing page and compare the averages 
across the months. 


Do you notice any patterns? 


Is there anything that might explain to you why sales are down? 
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seek out causes 


4SK^l_ 

ExGRaSe 

§oLvtlo 


Now you’ve looked closely at the data to figure out what patterns the data contains. 


Do you notice any patterns? 

All 七 he variables -for w Co-f-fcc value” bourse around wi*thm a havrow rar\^e. w Co-f-fcc 

广 ra*tu\re,” -for e%ar^ple, has a high sdorc o-f 十？ dr\d d loy/ stoYt o-f 午 .7, whidh ish ’ 七 mudh 
va\ria*tioh. u Co-f-fce valuc/^ *the o-thev* hand，shows a pv-e*t*ty si^i-fidah*t dedme. The Dcdembcr sdovc 

is hal-f o-f *the sdore, whidh dould be 3 dedl- 

Is there anything that might explain to you why sales are down? 

I 七 would make scr\sc to say i-f people oy\ average *tK*mk 七 ha 七 -the do-f-fee is^-t a y>oA value -for 
■the mohey, -they^d -tc^d to sj>cr\d less mohey a*t Starbuz^. A^d because -the tto^o^s dowr\, i*t makes 
serose *tha*t people have less mohey a^d *tha*t *they’d -f'md S-tav-buz^. to be less a value. 


Could value perception be 
causing the revenue decline? 


Starbuzz Coffee 

Summary of marketing surveys for six months ending January 2009. The figures represents the 
average score given to each statement by survey respondents from participating stores. 


According to the data, everything’s going 
along just fine with Starbuzz customers, 
except for one variable: perceived 
Starbuzz coffee value. 

It looks like people might be buying less 
because they don’t think Starbuzz is 
a good bang for the buck. Maybe the 
economy has made people a little more 
cash-strapped, so they’re more sensitive 
to prices. 

Let’s call this theory the “value problem.” 




Aug-08 

Sept-08 

Oct-08 

Nov-08 

Dec-08 

Jan-09 

Location convenience 

4.7 

4.6 

4.7 

4.2 

4.8 

4.2 

Coffee temperature 

4.9 

4.9 

4.7 

4.9 

4.7 

4.9 

Courteous employees 

3.6 

4.1 

4.2 

3.9 

3.5 

4.6 

Coffee value 

4.3 

3.9 

3.7 

3.5 

3.0 

2.1 

Preferred destination 

3.9 

4.2 

3.7 

4.3 

4.3 

3.9 


Participating stores 

100 

101 

99 

99 

101 

100 


TWis variable sV^oy/s 
a s-tcady 

dctl'mc ovcv- {\\t 
pas 七 s'l% mor\*tiiS. 



Do you think that the decline 
in perceived value is the 
reason for the sales decline? 
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experiments 


tJiereiare no ^ 

Dumb Questi9ns 


How do I know that a decline in 
value actually caused coffee sales to go 
down? 

You don't. But right now the perceived 
value data is the only data you have that is 
congruent with the decline in sales. It looks 
like sales and perceived value are going 
down together, but you don’t know that the 
decline in value has caused the decline in 
sales. Right now, it’s just a theory. 

Could there be other factors at 
play? Maybe the value problem isn’t as 
simple as it looks. 

There almost certainly are other 
factors at play. With observational studies, 
you should assume that other factors are 


confounding your result, because you 
can’t control for them as you can with 
experiments. More on those buzzwords in a 
few pages. 

Could it be the other way around? 
Maybe declining sales caused people to 
think the coffee is less valuable. 

That’s a great question, and it could 
definitely be the other way around. A good 
rule of thumb for analysts is, when you’re 
starting to suspect that causes are going in 
one direction (like value perception decline 
causing sales decline), flip the theory around 
and see how it looks (like sales decline 
causes value perception decline). 

So how do I figure out what causes 

what? 


We're going to talk a lot throughout 
this book about how to draw conclusions 
about causes, but for now, you should 
know that observational studies aren’t that 
powerful when it comes to drawing causal 
conclusions. Generally, you'll need other 
tools to get those sorts of conclusions. 

It sounds like observational studies 
kind of suck. 

Not at all! There is a ton of 
observational data out there, and it’d 
be crazy to ignore it because of the 
shortcomings of observational studies. 
What’s really important, however, is that you 
understand the limitations of observational 
studies, so that you don’t draw the wrong 
conclusions about them. 








Your so-called ''value problem" is 
no problem at all at my stores! Our 
Starbuzz is hugely popular, and no one 
thinks that Starbuzz is a poor value. 
There must be some sort of mistake. 


TV>c ma^a^cv- 

o( S*tav-buzz. J 
SoHo s*tovcs. 


The manager of the SoHo stores 
does not agree 

SoHo is a wealthy area and the home of a 
bunch of really lucrative Starbuzz stores, 
and the manager of those stores does not 
believe it’s true that there’s a value perception 
problem. Why do you think she’d disagree? 

Are her customers lying? Did someone record 
the data incorrectly? Or is there something 
problematic about the observational study 
itself? 
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causal diagram 


A typical customer's thmkmg 


Jim: Forget about Starbuzz SoHo. Those guys just 
don’t know how to read the numbers, and numbers 
don’t lie. 

Frank: I wouldn’t be so quick to say that. Sometimes 
the instincts of the people on the ground tell you more 
than the statistics. 


Evcv-yo^c^s by 七 his. 



Economy 

down 




0 



Joe: You’re so right on. In fact, I’m tempted to just 
scrap this entire data set. Something seems fishy. 


Joe: I dunno. The fishy smell? 


Frank: Look, we need to go back to 
our interpretation of the typical or 
average customer. 


Joe: Fine. Here it is. I drew a picture. 


Frank: Is there any reason why this chain of 
events wouldn’t apply to people in SoHo? 


Jim: Maybe the SoHo folks are not hurting 
economically. The people who live there are sickly 
rich. And full of themselves, too. 


Jim: What specific reason do you have 
that this data is flawed? 


to 


believe 


Joe: Hey, my girlfriend lives in SoHo. 

Frank: How you persuaded someone from the 
fashionable set to date you I have no idea. Jim, you 
may be on to something. If you’re doing well money- 
wise, you’d be less likely to start believing that Starbuzz 
is a poor value. 




Starbuzz sales 
go down 


Its always a good idea 
"to d\raw <^p how 

you thihk things vclatc. 



tWis 



It looks like tke SoHo 
Startuzz customers 
may te ctiilerent 
from all tke otker 
Startuzz customers … 
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experiments 


Observational studies 
arc full of confounders 

A confounder is a difference among the 
people in your study other than the factor 
you’re trying to compare that ends up making 
your results less sensible. 


In this case, you’re comparing Starbuzz 
customers to each other at different points 
in time. Starbuzz customers are obviously 
different from each other — they’re different 
people. 

But if they’re different from each other 
in respect to a variable you’re trying to 
understand, the difference is a confounder, and 
in this case the confounder is location. 


ttcvc avc dll o( 
S-tavbuzz^s dus*tomcv-s. 



ttcv-c av-c Sotto people. 

!!!!!!!!!!!?!!!!! 

!?!!!!!!?!!!!!!!!! 

!?!?!!!!?!!!!?!!?? 

!!!!!!!!!!!!!!!!!! 

!!!!!!!!!!!!!!!!!! 

!!!!!!!!!!!!!!!!!! 


Wii 


The ^usiomevs that av-c located ih Sott 
di-PWht ^ the vest 
a way that ^Oh-Pouhds ouv v-csults. 


Sharpen your pencil 




Redraw the causal diagram from the facing page to distinguish 
between SoHo stores and all the other stores., correcting for the 
location confounder. 

Assume that the SoHo manager is correct and that SoHo 
customers don’t perceive a value problem. How might that 
phenomenon affect sales? 
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darned confounders 


How location might be 
confounding your results 


Here’s a refined diagram to show how 
things might be happening. It’s a really 
good idea to make your theories 
visual using diagrams like this, which 


help both you and your clients keep 
track of your ideas. 



Economy 

down 


a-f-fedis everyone. 







What would you do to the data to show whether 
Starbuzz value perception in SoHo is still 
going strong? More generally, what would you 
do to observational study data to keep your 
confounders under control? 
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experiments 


there^ctre no ^ 

Dumb Questi9ns 


In this case, isn’t it really the wealth of the customers 
rather than the location that confounds the results? 

Sure, and they’re probably related. If you had the data on 
how much money each customers has, or how much money each 
customer feels comfortable spending, you could run the analysis 
again to see what sort of results wealth-based grouping gets you. But 
since we don’t have that information, we’re using location. Besides, 
location makes sense, because our theory says that wealthier people 
tend to shop in SoHo. 

Could there be other variables that are confounding this 
data besides location? 

Definitely. Confounding is always a problem with observational 
studies. Your job as the analyst is always to think about how 
confounding might be affecting your results. If you think that the 
effect of confounders is minimal, that’s great, but if there’s reason 
to believe that they’re causing problems, you need to adjust your 
conclusion accordingly. 

What if the confounders are hidden? 

That’s precisely the problem. Your confounders are usually not 
going to scream out to you. You have to dig them up yourself as you 
try to make your analysis as strong as possible. In this case, we are 
fortunate, because the location confounder was actually represented 
in the data, so we can manipulate the data to manage it. Often, the 
confounder information won't be there, which seriously undermines 
the ability of the entire study to give you useful conclusions. 


How far should I go to figure out what the confounders 

are? 

It's really more art than science. You should ask yourself 
commonsense questions about what it is you're studying to imagine 
what variables might be confounding your results. As with everything 
in data analysis and statistics, no matter how fancy your quantitative 
techniques are, it’s always really important that your conclusions 
make sense. If your conclusions make sense, and you’ve 
thoroughly searched for confounders, you've done all you can do for 
observational studies. Other types of studies, as you’ll see, enable 
you to draw some more ambitious conclusions. 

Is it possible that location wouldn’t be a confounder in 
this same data if I were looking at something besides value 
perception? 

Definitely. Remember, location being a confounder makes 
sense in this context, but it might not make sense in another context. 
We have no reason to believe, for example, that people’s feelings 
about whether their coffee temperature is right vary from place to 
place. 

I’m still feeling like observational studies have big 
problems. 

There are big limitations with observational studies. This 
particular study has been useful to you in terms of understanding 
Starbuzz customers better, and when you control for location in the 
data the study will be even more powerful. 
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chunk your confounders 


Manage cowfouwdcrs by 
breaking the data into chunks 

To get your observational study confounders under 
control, sometimes it’s a good idea to divide your groups 
into smaller chunks. 


These smaller chunks are more homogenous. In other 
words, they don’t have the internal variation that might 
skew your results and give you the wrong ideas. 

Here is the Starbuzz survey data once again, this time 
with tables to represent other regions. 


ovi^mal 

dd*kd Summavy. 


Starbuzz Coffee: All stores 

Summary of marketing surveys for six months ending January 2009. The figures represents the 
average score given to each statement by survey respondents from participating stores. 



Aug-08 

Sept-08 

Oct-08 

Nov-08 

Dec-08 

Jan-09 

Location convenience 

4.7 

4.6 

4.7 

4.2 

4.8 

4.2 

Coffee temperature 

4.9 

4.9 

4.7 

4.9 

4.7 

4.9 

Courteous employees 

3.6 

4.1 

4.2 

3.9 

3.5 

4.6 

Coffee value 

4.3 

3.9 

3.7 

3.5 

3.0 

2.1 

Preferred destination 

3.9 

4.2 

3.7 

4.3 

4.3 

3.9 



Mid-Atlantic stores only 



Aug-08 

Sept-08 

Oct-08 

Nov-08 

Dec-08 

Jan-09 

Location convenience 

4.9 

4.5 

4.5 

4.1 

4.9 

4.0 

Coffee temperature 

4.9 

5.0 

4.5 

4.9 

4.5 

4.8 

Courteous employees 

3.5 

3.9 

4.0 

4.0 

3.3 

4.5 

Coffee value 

4.0 

3.5 

2.9 

2.6 

2.2 

0.8 

Preferred destination 

4.0 

4.0 

3.8 

4.5 

4.2 

4.1 


Seattle stores onl 



Aug-08 

Sept-08 

Oct-08 

Nov-08 

Dec-08 

Jan-09 

Location convenience 

4.8 

4.5 

4.8 

4.4 

5.0 

4.1 

Coffee temperature 

4.7 

4.7 

4.8 

5.1 

4.5 

4.9 

Courteous employees 

3.4 

3.9 

4.4 

4.0 

3.5 

4.8 

Coffee value 

4.3 

3.8 

3.2 

2.6 

2.1 

0.6 

Preferred destination 

3.9 

4.0 

3.8 

4.4 

4.3 

3.8 


SoHo stores only 





Aug-08 

Sept-08 

Oct-08 

Nov-08 

Dec-08 

Jan-09 

Location convenience 

4.8 

4.8 

4.8 

4.4 

4.8 

4.0 

Coffee temperature 

4.8 

5.0 

4.6 

4.9 

4.8 

5.0 

Courteous employees 

3.7 

4.1 

4.4 

3.7 

3.3 

4.8 

Coffee value 

4.9 

4.8 

4.8 

4.9 

4.9 

4.8 

Preferred destination 

3.8 

4.2 

3.8 

4.2 

4.1 

4.0 






These groups 

ih-fce\rhally 

JlomogChous. 
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experiments 


1 % 
Exe^ciSe 


Take a look at the data on the facing page, which has been broken 
into groups. 


How much of a difference is there between the Mid-Atlantic store subgroup 
average scores and the average scores for all the Starbuzz stores? 


How does perceived coffee value compare among all the groups? 


Was the SoHo manager right about her customers being happy with the value of Starbuzz 
coffee? 
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split decisions 



lu：iSe 

LytiOH 


When you looked at the survey data that had been grouped by 
location, what did you see? 


How much of a difference is there between the Mid-Atlantic store subgroup 
average scores and the average scores for all the Starbuzz stores? 

All *tKc scores y/i^lc drouhd'm same narrow -for -the value sdorc. \/dlue 


pcrdcj>*tioh jus 七 -falls o-f-f d dli-f-f \v\ *thc vcjioh Compared bo all-region average/ 


How does perceived coffee value compare among all the groups? 

Seattle has a pvedipi-tous drop, just like *the v-c^ioh. Sotto, ov\ 七 he 。七 her hdhd ； appears 

*bo be do*m^ just -f'me. Sotto^s value j>crdc^*tioh scores bca*t *the all-rcjioh average handily. I*t looks like 
■the Customers *m *tlVis avc j>\rc*t*ty pleased wi*th S-tav-buz^s value- 


Was the SoHo manager right about her customers being happy with the value of Starbuzz 
coffee? 

The dd*bd dc-f*mi*tdy doh-firm -the Sotto mahajev^s beliefs about wha*t hcv dus*bomc\rs *th*mk about 
S-tarbu^s value. |*t was dc\rta*mly a ^ood idea -to lis-tc^ to her -feedback ar\d look a 七 da*ta'm a 
di-f-fcrch-t way because 七 -feedback. 
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experiments 


Ifs worse thaw wg thought! 

The big guns have all come out to deal 
with the problems you’ve identified. 



Chic-f 


CFO: This situation is worse than we had anticipated, 
by a long shot. The value perception in our regions other 
than SoHo has absolutely fallen through the floor. 


Marketing: That’s right. The first table, which 
showed all the regions together, actually made the 
value perception look better than it is. SoHo skewed the 
averages upward. 


CFO: When you break out SoHo, where everyone’s 
rich, you can see that SoHo customers are pleased with 
the prices but that everyone else is about to jump ship, if 
they haven’t already. 

Marketing: So we need to figure out what to do. 

CFO: I’ll tell you what to do. Slash prices. 

Marketing: What?!? 

CFO: You heard me. We slash prices. Then people will 
see it as a better value. 

Marketing: I don’t know what planet you’re from, but 
we have a brand to worry about. 

CFO: I come from Planet Business, and we call this 
supply and demand. You might want to go back to 
school to learn what those words mean. Gut prices and 
demand goes up. 

Marketing: We might get a jump in sales in the short 
term, but we’ll be sacrificing our profit margins forever 
if we cut costs. We need to figure out a way to persuade 
people that Starbuzz is a value and keep prices the same. 

CFO: This is insane. I’m talking economics. Money. 
People respond to incentives. Your fluffy little ideas won’t 
get us out of this 


VP Alavkctihg 




Is there anything in the data 
you have that tells you which 
strategy will increase sales? 
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experiment for strategy 


You need an experiment to say 
which strategy will work best 

Look again at that last question on the previous page: 


Is there anything in the data 
you have that tells you which 
strategy will increase sales? 

Observational by ^3^ "t "tell 

you will m i\\t 仏 We. 



You have no observational data that will tell you what 
will happen if you try out what either the VP of 
Marketing or the CFO suggests. 

If you want to draw conclusions about things 
that overlap with your data but aren’t completely 
described in the data, you need theory to make the 
connection. 


TVicsc h\tor\ts be or -totally 

•false, bu 七 you\r docsy\ say. 


Marketing^ Branding Theory 

People respond to persuasion. 



CFO’s Economic Theory 

People respond to price. 


Marketing’s strategy 

Appeal to people’s judgement. Starbuzz 
really is a good value, if you think about it in 
the right way. Persuading people to change 
their beliefs will get sales back up. 


CFO’s strategy 

Slash the cost of coffee. This will cause 
people to perceive more value in Starbuzz 
coffee, which will drive sales back up. 


You have no data to support either of these theories, 
no matter how passionately the others believe in them 
and in the strategies that follow from them. 

In order to get more clarity about which strategy is 
better, you’re going to need to run an experiment. 


You need to experiment witk 
tkese strategies in orcter to 
know wkick will increase sales. 
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experiments 



The Starbuzz CEO 
is in a big hurry 

And he’s going to pull the trigger 
whether you’re ready or not! 

Let’s see how his gambit works out. 


Vi 



3 


you are here ► 


55 




cheaper coffee 


Starbuzz drops its prices 

Taking a cue from the CFO, the CEO ordered 
a price drop across the board for the month of 
February. All prices in all Starbuzz stores are 
reduced by SO.25. 



Will tliis ckange create 
a spike in sales? 


How will you know? 


56 
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One month later... 


Great! Our board was upset at lowering 
prices, but look how well it went. Now I 
need to know how much more money we 
made as a result of the move. 


O 


Average store revenue per Jay Kcw p\rid*mg 





It’d be nice to know how much extra Starbuzz earned in February that they wouldn’t 
have earned if prices hadn’t been dropped. Do you think sales would have the data to 
help figure this out? Why or why not? 
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baseline controls 



Does sales have the data that would help you figure out how much more money you 
made off the cheaper $3.75 coffee? 


Sales douldr / 七 have -the da-ba- They o^ly Kavc daia -for ar\d -they da^*t Compare *tha 七 


da*bd *bo hyj>o*thc*ti^al dd*boi about wKa*t k'md c^f revenue f^rOO Coffee would have brou^lrt 


Control groups give you a baseline 

You have no idea how much extra you made. 

Sales could have skyrocketed relative to what 
they would have been had the GEO not cut 
prices. Or they could have plummeted. You 
just don’t know. 

You don’t know because by slashing prices 
across the board the CEO failed to follow the 

method of comparison. Good experiments 
always have a control group that enables the 
analyst to compare what you want to test with 
the status quo. 





6c]iolat's Cottier - 

Co 妁 *brol ^\roup A ^v-oup op *brea*tme 灼七 
subjedis 七 ha 七 rcfrcschi 七 he s-taius <^uo, 




Vo control group means no comparison. 

Vo comparison means no ictea wkat kappenect. 
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Can’t we compare February’s sales 
with January’s sales? 

Sure, and if all you’re interested in is 
whether sales in February are higher than 
January, you’ll have your answer But without 
a control, the data doesn’t say whether your 
price-cutting had anything to do with it. 

What about comparing this 
February’s sales with last year’s 
February’s sales? 

In this question and the last one you’re 
talking about using historical controls, 
where you take past data and treat it as the 
control, as opposed to contemporaneous 
controls, where your control group has 
its experience at the same time as your 
experimental group. Historical controls 
usually tend to favor the success of whatever 
it is you’re trying to test because it’s so hard 
to select a control that is really like the group 
you're testing. In general, you should be 
suspicious of historical controls. 

Do you always need a control? Is 
there ever a case where you can get by 
without one? 

A lot of events in the world can’t 
be controlled. Say you’re voting in an 
election: you can’t elect two candidates 
simultaneously, see which one fares better 
relative to the other, and then go back and 


tJiereiare no ^ 

Dumb Questi9ns 


elect the one that is more successful. That’s 
just not how elections work, and it doesn't 
mean that you can’t analyze the implications 
of one choice over the other. But if you could 
run an experiment like that you’d be able to 
get a lot more confidence in your choice! 

What about medical tests? Say you 
want to try out a new drug and are pretty 
sure it works. Wouldn’t you just be letting 
people be sick or die if you stuck them 
in a control group that didn’t receive 
treatment? 

That’s a good question with a 
legitimate ethical concern. Medical studies 
that lack controls (or use historical controls) 
have very often favored treatments that are 
later shown by contemporaneous controlled 
experiment to have no effect or even be 
harmful. No matter what your feelings are 
about a medical treatment, you don’t really 
know that it’s better than nothing until you do 
the controlled experiment. In the worst case, 
you could end up promoting a treatment that 
actually hurts people. 

Like the practice of bleeding people 
when they were sick? 

Exactly. In fact, some of the first 
controlled experiments in history compared 
medical bleeding against just letting people 
be. Bleeding was a frankly disgusting 
practice that persisted for hundreds of years. 
We know now that it was the wrong thing to 
do because of controlled experiments. 


Do observational studies have 
controls? 

They sure do. Remember the definition 
of observational studies: they’re studies 
where the subjects themselves decide what 
group they're in, rather than having you 
decide it. If you wanted to do a study on 
smoking, for example, you couldn't tell some 
people to be smokers and some people 
not to be smokers. People decide issues 
like smoking on their own, and in this case, 
people who chose to be nonsmokers would 
be the control group of your observational 
study. 

I’ve been in all sorts of situations 
where sales have trended upwards in 
one month because we supposedly 
did something in the previous month. 

And everyone feels good because we 
supposedly did well. But you’re saying 
that we have no idea whether we did 
well? 

Maybe you did. There's definitely 
a place for gut instincts in business, 
and sometimes you can’t do controlled 
experiments and have to rely on 
observational data-based judgements. But 
if you can do an experiment, do it. There’s 
nothing like hard data to supplement your 
judgement and instincts when you make 
decisions. In this case, you don’t have the 
hard data yet, but you have a CEO that 
expects answers. 


Tke CEO still wants to know kow muck 
extra money tke new strategy matte … 
How will you answer kis request? 
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communication pitfalls 


Jim: The GEO asked us to figure out how 
much money we made in February that we 
wouldn’t have made if we hadn’t cut costs. 
We need to give the guy an answer. 

Frank: Well, the answer is a problem. We 
have no idea how much extra money we 
made. It could have been a lot, but we could 
have lost money. Basically, we’ve fallen flat 
on our faces. We’re screwed. 

Joe: No way. We can definitely compare the 
revenue to historical controls. It might not be 
statistically perfect, but he’ll be happy. That’s 
all that counts. 

Frank: A happy client is all that counts? 
Sounds like you want us to sacrifice the war 
to win the day. If we give him the wrong 
answers, it’ll eventually come back on us. 

Joe: Whatever. 

Frank: We’re going to have to give it to him 
straight, and it won’t be pretty. 

Jim: Look, we’re actually in good shape 
here. All we have to do is set up a control 
group for March and run the experiment 
again. 

Frank: But the GEO is feeling good about 
what happened in February, and that’s 
because he has the wrong idea about what 
happened. We need to disabuse him of that 
good feeling. 

Jim: I think we can get him thinking clearly 
without being downers about it. 
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Not getting fired 101 

Having to deliver bad news is part of being a 
data analyst. But there are a bunch of different 
ways of going about delivering the same 
information. 

Let’s get straight to the point. How do you 
present bad news without getting fired? 


Tke Lest data analysts 
know tke rigfltt way 
to cteliver potentially 
upsetting messages. 



Option 2: The news is 
bad, so let’s panic! 


Youre right! Sales 
are rocking and rolling. 
Were up 100%. You re a 
genius! 


We've blown our brains 
out. Catastrophic 
meltdown. Please don’t 
fire me. 



Option 3: There’s bad 
news, but if we use it 
correctly it’s good news. 


Which of these approaches 
won’t get you fired... 


Option 1 : There is 
no bad news. 


Today? 

Tomorrow? 


For your next gig? 
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a better experiment 


for real 1 . 

Lefs experiment agai^i 


We’re running the experiment again for the 
month of March. This time, Marketing 
divided the universe of Starbuzz stores into 
control and experimental groups. 

The experimental group consists of stores 
from the Pacific region, and the control group 
consists of stores from the SoHo and Mid- 
Atlantic regions. 


From: CEO, Starbuzz 
To: Head First 

Subject: Need to re-run experiment 


I get the picture. We still have two 
months before the board meeting. 
Just do what you need to do and get 
That >was a it right this time. 

dlose ov\e! 



Experimental Group 

Pacific region 






$3.75 


Control Group 

SoHo and Mid-Atlantic 
regions 









$4.00 


*tV>c same- 



$4.00 
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Owe mowth later... 


Things aren’t looking half bad! Your 
experiment might have given you the answer 
you want about the effectiveness of price 
cutting. 


Daily 

Sales 




trying to compare. 


Sharpen your pencil 


Look at the design on the facing page and the results above. 
Could any of these variables be confounding your results? 


Culture 


Location 


Coffee temperature 


Weather 
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confounded again 


Sharpen your pencil 

K Solution 


Is it possible that these variables are confounding your results? 


Culture 

The Culture ou^lvt *bo be same all over. 

Coffee temperature 

This should be *thc same cvcrywhcvc, too. 


Location 

Loda*tioh dould dc-f 'mi-tdy doh-fouhd. 

Weather 

Could bc| lVca*thcv is pav-*t o-f loda-tio^. 


Cowfouwders also 
plague experiments 

Just because you’ve stepped out of the world 
of observational studies to do an experiment 
you’re not off the hook with confounders. 

In order for your comparison to be valid，your 

groups need to be the same. Otherwise, 
you’re comparing apples to oranges! 


>uVc tompavrng -these -ty/o, bu*t thcyVc 

d*i-p-(*cv-cy>*t mov-C >ways *tKc *brea 七 merrt. 






Spirt s*bo\res *m*to 
groups by Region 


^jC) 


CanfaimJing Up Cl^se 


Your results show your experimental group making more 
revenue. It could be because people spend more when the coffee 
cost less. But since the groups aren’t comparable, it could 
be for any number of other reasons. The weather could be keeping 
people on the east coast indoors. The economy could have taken off 
in the Pacific region. What happened? You’ll never know, because 
of confounders. 
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Avoid cowfouwders by 
selecting groups carefully 

Just as it was with observational studies, 
avoiding confounders is all about splitting the 
stores into groups correctly. But how do you 
do it? 


All Starbuzz 
Customers 




rpen your pencil 


Here are four methods for selecting groups. How do you think 
each will fare as a method for avoiding confounders. Which one 
do you think will work best? 


Charge every other customer differently 
as they check out. That way, half of your 
customers are experimental, half are 
control, and location isn’t a confounder. 


Use historical controls, making all the 
stores the control group this month and 
all the stores the experimental group 
next month. 


Randomly assign different stores to 
control and experimental groups. 


Divide big geographic regions into small 
ones and randomly assign the micro¬ 
regions to control and experimental 
groups. 
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meet randomization 


^Sharpen your pencil 

Solution 


Which method for selecting groups do you think is best? 


Charge every other customer differently The Customers y/ould -freak out. I/Vho y/a^ts *to have *to pay 
as they check out. That way, half of your 

customers are experimental, half are 你州七—忪 person sta^di^ to 七一？ CusWer 

control, and location isn t a confounder. . '• . ^ . 

would doy\-four\d your \rcsul*ts. 

Use historical controls, making all the l/Vc'vc already sttv\ why hisWidal dorrtrols are a problem. I/Vho 

stores the control group this month and 

all the stores the experimental group next k _ s ^ ^ d\Utr^i r,o^hs bo iKrow 

month. . u . 

o^f results? 

Randomly assign different stores to TKis looks kihd promism^ but it dotsv^i <\ui*tc -fit the bill, 

control and experimental groups. . 

People would jus*t 50 *to -the dKcapcv S-tav-buz^. ou*tlc*b \ra*thcv 
dorrbrol g,\rouf. Loda*tior\ would still doh-fouhd- 

Divide big geographic regions into small |-f your rc^iohs y/crc bi^ that fcoflc y/ould〆 七 travel *to 

ones and randomly assign the micro- 

regions to control and experimental 七如 bui sr，all ⑶ ou# be similar bo ^ other, 

groups. ^ . •. . 口 . . 

you dould avoid lodatioh doh-fouhdihj. This is the best bc*t* 


Looks like tkere is sometking 
to tkis ranctomization metkoct. 
Let’s take a closer look... 
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Randomization selects 
similar groups 

Randomly selecting members from your pool 
of subjects is a great way to avoid confounders. 

What ends up happening when you randomly 
assign subjects to groups is this: the factors that 
might otherwise become confounders end up 
getting equal representation among your 
control and experimental groups. 



Experimental 
groups 
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random issues 



Eaiftdiomiftcss 

This week’s interview: 

OMG that was so random! 


Head First ： Randomness, thank you for joining us. 
You’re evidently a big deal in data analysis and it’s 
great to have you. 

Randomness ： Well, my schedule from one second 
to the next is kind of open. I have no real plan per se. 
My being here is, well, like the roll of the dice. 

Head First ： Interesting. So you have no real plan or 
vision for how you do things? 

Randomness ： That’s right. Willy-nilly is how I 
roll. 

Head First: So why are you useful in experimental 
design? Isn’t data analysis all about order and 
method? 

Randomness ： When an analyst uses my power to 
select which experimental and control groups people 
or stores (or whatever) go into, my black magic makes 
the resulting groups the same as each other. I can 
even handle hidden confounders, no problem. 

Head First: How s that? 

Randomness ： Say half of your population is 
subject to a hidden confounder, called Factor X. 
Scary, right? Factor X could mess up your results big 
time. You don’t know what it is, and you don’t have 
any data on it. But it’s there, waiting to pounce. 

Head First ： But that’s always a risk in observational 
studies. 

Randomness ： Sure, but say in your experiment 

you use me to divide your population into 
experimental and control groups. What’ll happen 
is that your two groups will end up both containing 
Factor X to the same degree. If half of your overall 


population has it, then half of each of your groups 
will have it. That’s the power of randomization. 

Head First: So Factor X may still affect your 
results, but it’ll affect both groups in the exact same 
way, which means you can have a valid comparison 
in terms of whatever it is you’re testing for? 

Randomness ： Exactly. Randomized controlled 

is the gold standard for experiments. You can do 
analysis without it, but if you have it at your disposal 
you’re going to do the best work. Randomized 
controlled experiments get you as close as you can 
get to the holy grail of data analysis: demonstrating 
causal relationships. 

Head First ： You mean that randomized controlled 
experiments can prove causal relationships? 

Randomness ： Well, “proof” is a very, very strong 
word. I’d avoid it. But think about what randomized 
controlled experiments get you. You’re testing two 
groups that are identical in every way except in the 
variable you’re testing. If there’s any difference in 
the outcome between the groups, how could it be 
anything besides that variable? 

Head First ： So how do I do randomness? Say I 
have a spreadsheet list I want to split in half, selecting 
the members of the list randomly. How do I do it? 

Randomness ： Easy. In your spreadsheet program, 
create a column called “Random” and type this 
formula into the first cell: =RAND () . Copy and paste 
the formula for each member of your list. Then sort 
your list by your “Random” column. That’s it! You 
can then divide your list into your control group 
and as many experimental groups as you need, and 
you’re good to go! 
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Sharpen your pencil 


It’s time to design your experiment. Now that you understand 
observational and experimental studies, control and 
experimental groups, confounding, and randomization, you 
should be able to design just the experiment to tell you what you 
want to know. 


What are you trying to demonstrate? Why? 


What are your control and experimental groups going to be? 


How will you avoid confounders? 
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design complete 


Sharpen your pencil 
< Solutlan 


You’ve just designed your first randomized controlled experiment. 
Will it work as you had hoped? 


What are you trying to demonstrate? Why? 

The purpose o-f is bo -figure ou*t whidK y/ill do d bc*t*tcv job of sales ： 

*thc s*ta*tus quo, pvidcs, or -tv-ymj *bo persuade dus*bomc\rs *tha*t S*ta\rbuzi do-f-fee is a 

o^ood value- WeVc ^o*mj bo vu^ c^crimch-t over *the bourse of oy\c morrtlv Alavdh. 

What are your control and experimental groups going to be? 

The doh*brol ^\roup will be stores *tha*t av*c -fuhd*tioh*mj as *thcy dlv/dys -fuhd*tior\—r\o specials o\r 

(- Oy\c e%pev-imeh*tal jv-ouj> will dor\sis*t o-f s-tores *tha*t hdve a fv-idc dv-o^ -for Mavdh. The o*thcv- 
e%5>cv-imeh*tal jroup will dohsis*t stores where employees *tv-y *to persuade dus-bomevs *tha 七 S-t^v-buzi is 
a ^ood value- 

How will you avoid confounders? 

By sclcd*tmOj groups darc-fully. WlcVc bo divide major S*t3\rbuzi rc^ioh *m*to mid\ro — vejjiohs, 
dhd well randomly assi^h members o*f fool o-f midro—rcjiohS bo *the dorrbrol dhd 

groups. Tha*t >way> our -thv-cc groups will be about *thc sar»»e. 


What will your results look like? 

|*t’s impossible *to khow urrtil we \ruh *the c%pc\rinr\Ch*t) bu 七 wha*t ha 饮 eh is 七 oy\C or bo*th o-f *thc 

e%J>CV-imCh*tal groups shows higher sales *the dorrbrol j\rouJ>. 
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Your experiment is ready to go 


Before we run it, let’s take one last 
look at the process we’re going 
through to show once and for all 
which strategy is best. 





Analyze the results by 
f comparing the groups )■ 
to each other 
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review results 


The results are m 

Starbuzz set up your experiment and let it run 
over the course of several weeks. The daily 
revenue levels for the value persuasion group 
immediately went up compared to the other 
two groups, and the revenue for the lower 
prices group actually matched the control. 


Looks like this s-tv-atejy is the 



This chart is so useful because it makes an 
excellent comparison. You selected identical 
groups and gave them separate treatments, so 
now you can really attribute the differences in 
revenue from these stores to the factors you’re 
testing. 


These are great results! 

Value persuasion appears to result in 
significantly higher sales than either lowering 
prices or doing nothing. It looks like you have 
your answer. 
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Starbuzz has an empirically 
tested sales strategy 

When you started this adventure in experiments, Starbuzz 
was in disarray. You carefully evaluated observational survey 
data and learned more about the business from several bright 
people at Starbuzz, which led you to create a randomized 
controlled experiment. 

That experiment made a powerful comparison, which 
showed that persuading people that Starbuzz coffee is a more 
effective way to increase sales than lowering prices and doing 
nothing. 
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3 opt!m!^cit!on 


本 Take it to the max ♦ 



We all want more of something. 

And we’re always trying to figure out how to get it. If the things we want more of — profit, 
money, efficiency, speed — can be represented numerically, then chances are, there’s an 
tool of data analysis to help us tweak our decision variables, which will help us find the 
solution or optimal point where we get the most of what we want. In this chapter, you’ll be 
using one of those tools and the powerful spreadsheet Solver package that implements it. 


this is a new chapter 
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squeaky squeak 


You're now m the bath toy game 


You’ve been hired by Bathing Friends 
Unlimited, one of the country’s premier 
manufactures of rubber duckies and fish 
for bath-time entertainment purposes. 
Believe it or not, bath toys are a serious 
and profitable business. 


They want to make more money, and 
they hear that managing their business 
through data analysis is all the rage, so 
they called you! 


Xhc v*ubbcv (\s\\ is ⑽ doyw 伙七 io 的 al 

tho'yU, bu 七 rt's 3 seller- 



Sorwc dal I ii ihc dlassid ； some say i-t^s 
"too obvious ； bu 七 ov\t "thmj is deav: 

七 he \rubbc\r dudky is heve -to stay. 




ril give your firm top 
consideration as I make 
my toy purchases this 
year. 


You have disdcv-^m^ dus*tomc\rs. 




A 
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optimization 


Sharpen your pencil 


Here’s an email from your client at Bathing Friends Unlimited, 
describing why they hired you. 


From: Bathing Friends Unlimited 
To: Head First 

Subject: Requested analysis of product mix 
Dear Analyst, 

We’re excited to have you! 

We want to be as profitable as possible, 
and in order to get our profits up, we 
need to make sure we’re making the right 
amount of ducks and the right amount of 
fish. What we need you to help us figure 
out is our ideal product mix: how much of 
each should we manufacture? 

Looking forward to your work. WeVe heard 
great things. 

Regards, 

BFU 


cvc^s v/ha 七 youv 
»ays abou 七 七 sV>c r>ccds. 


What data do you need to solve this problem? 


you are here ► 


77 
















what you can and can't control 

^harpen your pencil 

Solution 


From: Bathing Friends Unlimited 
To: Head First 

Subject: Requested analysis of product 
mix 

Dear Analyst, 

We’re excited to have you! 

We want to be as profitable as possible, 
and in order to get our profits up we 
need to make sure we’re making the right 
amount of ducks and the right amount of 
fish. What we need you to help us figure 
out is our ideal product mix: how much of 
each should we manufacture? 

Looking forward to your work. We've heard 
great things. 

Regards, 

BFU 


What data do you need to solve this problem? 

First o-f all) rt’d be hide *to have dd*bcl oy\ jus 七 how p\ro^i*tablc dudks 
dhd -fish arc. |s oy\C more y>\ro-fi*tablc *the other? Bu 七 

itd be hide *bo khow y/ha*t o*thcv -factors dor\s*tvam problem. 


How mudh rubbev does i 七 *t3kc make 七 hese p\rodud*ts? how mudh 



.y®ur Dafa j^eetls Tip C]®se - 

Take a closer look at what you need to know. 

You can divide those data needs into two 
categories: things you can’t control, and 

things you can. 

■ How profitable ducks are 

■ How much time it takes to 
make fish 

■ How much time it takes to 
make ducks 


These av-c 

you da^*t torrbrol. 


■ 


How profitable fish are 


■ 


■ 


How much rubber they have 
to make fish 

How much rubber they have 
to make ducks 


And the basic thing the client wants you to find 
out in order to get the profit as high as possible. 
Ultimately, the answers to these two questions 

you can control. 


These "thirds 
you do^*tv*o|. 



■ 



How many fish to make 
How many ducks to make 


You need the hard 
numbers on what you 
can and can’t control. 
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optimization 


Cowstraiwts limit the 
variables you control 

These considerations are called constraints, 
because they will define the parameters for your 
problem. What you’re ultimately after is profit, 
and finding the right product mix is how you’ll 
determine the right level of profitability for next 
month. 

But your options for product mix will be limited by 
your constraints. 


These a\rc vouv- actual 


» nese d\rc youv-, 
dohs-tv-aihts -fov- 


this pvoblci 


PecisioH variables are 
things you caw control 

Constraints don’t tell you how to maximize profit; 
they only tell you what you can’t do to maximize 
profit. 

Decision variables, on the other hand, are the things 
you can control. You get to choose how many 
ducks and fish will be manufactured, and as long as 
your constraints are met, your job is to choose the 
combination that creates the most profit. 


From: Bathing Friends Unlimited 
To: Head First 

Subject: Potentially useful info 
Dear Analyst, 

Great questions. Re rubber supply: we 
have enough rubber to manufacture 500 
ducks or 400 fish. If we did make 400 fish, 
we wouldn’t have any rubber to make 
ducks, and vice versa. 

We have time to make 400 ducks or 300 
fish. That has to do with the time it takes to 
set the rubber. No matter what the product 
mix is, we can’t make more than 400 ducks 
and 300 fish if we want the product on 
shelves next month. 

Finally, each duck makes us $5 in profit, 

and each fish makes us $4 in profit. Does 
that help? 

Regards, 

BFU 


Y 0[a jet "fco choose 
values 

cadh o( these. 



dciier s*tay 
youv* 





So, what do you think you do 
with constraints and decision 
variables to figure out how to 
maximize profit? 
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welcome to optimization 


You have an optimization problem 


When you want to get as much (or as little) of 
something as possible, and the way you’ll get 
it is by changing the values of other quantities, 
you have an optimization problem. 

Here you want to maximize profit by changing 
your decision variables: the number of ducks 
and fish you manufacture. 



But to maximize profit, you have to stay within 
your constraints: the manufacture time and 
rubber supply for both toys. 


To solve an optimization problem, you need to 
combine your decision variables, constraints, 
and the thing you want to maximize together 
into an objective function. 
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optimization 


Find your objective with 
the objective fuwctiow 


The objective is the thing you want to 
maximize or minimize, and you use the 

objective function to find the optimum 
result. 


Here’s what your objective function looks like, 
if you state it algebraically: 

vc-fcv-s 
*to d dor>s*tva*m*t- 

CiX, 



Eadh \ ，1 \rc-pc\rs -to 




w p w *,s youv- 

you y /扣七 "to 


Don’t be scared! All this equation says is that 
you should get the highest P (profit) possible 
by multiplying each decision variable by a 
constraint. 

Your constraints and decision variables in this 



Some problems have mov-e 

objective -fu^d-tiohs. 


equation combine to become the profit of 
ducks and fish, and those together form your 
objective: the total profit. 



duck profit 




oSip 


fish profit 



You y/av>*t youv objective *to 
be ds hi# as you dair> y 七 • 七 • 


=Total Profit 


All optimization 
problems kave 
constraints anct an 
objective function. 



What specific values do you 
think you should use for the 
constraints, c-, and c 2 ? 
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objective function adventures 


Your objective fuwctiow 


The constraints that you need to put into 
your objective function are the profit for 
each toy. Here’s another way to look at that 
algebraic function: 



The profit you get from selling fish and ducks 
is equal to the profit per duck multiplied by 
the number of ducks plus the profit per fish 
multiplied by the number of fish. 




* 








Hcv-c^s youv* dierrt -fv-om 

Fv*icr>ds LUimrted. 


( 


profit per ^ 
duck 


count of 
ducks 





profit per . 
fish 


count % 
of fish / 






To-tal -f ish prof rt. 


=Profit 


Now you can start trying out some product 
mixes. You can fill in this equation with the 
values you know represent the profit per item 
along with some hypothetical count amounts. 



$5 profit * 


100 

ducks 



This is youv would 

be i-f you decide *to make 
100 du^ks -fish. 


50 

fish 



$700 


This objective function projects a S700 profit 
for next month. We’ll use the objective function 
to try out a number of other product mixes, 
too. 
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optimization 


Bathing Friends Unlimited has time to 
produce 400 ducks in the next month 





Product mix 1 doesn’t violate any constraints, 
but the other two do: product mix 2 has too 
many fish, and product mix 3 has too many 
ducks. 

Seeing the constraints in this way is progress, 
but we need a better visualization. We have yet 
more constraints to manage, and it’d be clearer 
if we could view them both on a single chart. 




丁 his lihe shows how r^hy -pish 
you have time -to pvodude. 


How would you visualize the 
constraints on hypothetical product 
mixes of ducks and fish with one chart? 


Product Product Product 
mix 1 mix 2 mix 3 


Show product mixes with 
your other cowstraiwts 

Rubber and time place limits on the count of 
fish you can manufacture, and the best way 
to start thinking about these constraints is 
to envision different hypothetical product 
mixes. Let’s start with the constraint of time. 


ttevVs y/hat say about 
*thc*iv time doy>s*tva'm*t- 


A hypothetical “Product mix 1” might be 
where you manufacture 100 ducks and 200 
fish. You can plot the time constraints for that 
product mix (and two others) on these bar 
graphs. 


rcks, and vice versa. 

We have time to make 400 ducks or 300 
fish. That has to do with the time it takes to 
set the rubber. No matter what the product 
mix is, we can’t make more than 400 ducks 
and 300 fish if we want the product on 
shelves next month. 


luck. 


in 


>fi1 


This I'me sKov/s -the 
灼 umbev* of dudks you da 灼 pv-odude- 


p cu unpojd s>junQ 
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make your graphics multivariate 


Well also be able to use this chart to visualize the 
rubber constraints. In fact, you can place any 
number of constraints on this chart and get 
an idea of what product mixes are possible. 



Bathing Friends Unlimited has time to 
produce 400 ducks in the next month 



Bathing Friends Unlimited has time to 
produce 300 fish in the next month 

500 「 




Plot multiple cowstraiwts 
ow the same chart 


We can plot both time constraints on a single chart, 
representing each product mix with a dot rather 
than a bar. The resulting chart makes it easy to 

visualize both time constraints together. 



Product Product Product 
mix 1 mix 2 mix 3 



paunpoJQ.s>pnQ 



o o o 

4 3 2 
p al unpojcl qs LZ 
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optimization 


Your good options are all 
m the feasible region 

Plotting ducks on a y-axis and fish on an x-axis 
makes it easy to see what product mixes are 
feasible. In fact, the space where product mixes 
are within the constraint lines is called the 

feasible region. 

When you add constraints to your chart, the 
feasible region will change, and you’ll use the 
feasible region to figure out which point is optimal. 


Tii'is is 

-feasible v-C 5 'ioy\. 


Sharpen your pencil 


500 


400 

—— 

300 


Q 200 


100 

0 

•丨 

100 200 300 400 500 


Fish 


Let’s add our other constraint, which states how many fish and 
ducks can be produced given the quantity of rubber they have. 
Bathing Friends Unlimited said: 


i-acM -f isK -takes a little mov-c 
vubbev* *fco make 七 eddh dutk. 



Great questions. Re rubber supply: we 
have enough rubber to manufacture 500 
ducks or 400 fish. If we did make 400 fish, 
we wouldn’t have any rubber to make 
ducks, and vice versa. 


o 


❺ 


fo\A have a +ixcd supply op v-ubbev-, so 
the humbc\r o-P du^ks you make will 
li^it the o-P -fish you cav\ make- 

Draw a point representing a product 
mix where you make 400 fish. As she 
says, if you make 400 fish, you won’t 
have rubber to make any ducks. 

Draw a point representing a product 
mix where you make 500 ducks. If you 
made 500 ducks, you’d be able to make 
zero fish. 


o Draw a line through the two points. 
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visualize consfra/nte 


Sharpen your pencil 
K Solution 


How does the new constraint look on your chart? 


o 


❺ 


❺ 


Draw a point representing a product 
mix where you make 400 fish. As she 
says, if you make 400 fish, you won’t 
have rubber to make any ducks. 

Draw a point representing a product 
mix where you make 500 ducks. If you 
made 500 ducks, you’d be able to make 
zero fish. 

Draw a line through 
the two points. 


Great questions. Re rubber supply: we 
have enough rubber to manufacture 500 
ducks or 400 fish. If we did make 400 fish, 
we wouldn’t have any rubber to make 
ducks, and vice versa. 


500f 、 

clucks ayvdi y\o fislv 


400 


U 

Q 


300 



200 


100 
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optimization 


100 ducks and 200 fish 
Profit: 


50 ducks and 300 fish 
Profit: 


When you added the rubber constraint, you 
changed the shape of the feasible region. 

Before you added the constraint, you might have 
been able to make, say, 400 ducks and 300 fish. But 
now your rubber scarcity has ruled out that product 
mix as a possibility. 

YouV fo 七⑼七 ial pvodut*t 

all Y\ttd *to be msidc hcvc. 


Use youv objective 灼 

■fco dc*tcv-m*mc 



$5 

profit 


★ 


count of 
ducks 


) 


+ 


$4 

profit 


★ 


count 
of fish 


) 


Profit 



rpen your pencil 



r 


D\raw whcv-c 
product rwix ^oes 
oy \ the dhav-t 



Here are some possible product mixes. 

Are they inside the feasible region? 

Draw a dot for each product mix on the chart. 

How much profit will the different product mixes create? 

Use the equation below to determine the profit for each, 


300 ducks and 250 fish 


Profit: 


Your wew cowstraiwt changed 
the feasible region 



use du^k/*fish 
^ombihatiohS that exist 
these spaces. 
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new feasibilities 


Sharpen your pencil 
k Solution 


You just graphed and calculated the profit for three different 
product mixes of ducks and fish. What did you find? 



Profit 如 duds )+( /午 Ash ) 二 /Z500 

Too bad *this product W 乂 is^*t *m 七 he -feasible vc^ioh. 


100 ducks and 200 fish. 

Profit : 々 ^\i*lOO dudks)+(/^ yY<^\i*ZOO f.sh ) 二 jllOO 
This produd 七 mix. dc-f mi*tdy works. 

50 ducks and 300 fish. 

Profit: pro-fi*t^50 dudks)+( / ^ -fisK) =• fl^O 

This p\rodud*t mn\% works dhd makes eveh mohey. 


Now all you 
have to do is 

try every possible 
product mix and 
see which one has 
the most profit, 
right? 
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optimization 


Even in the small space of the 
feasible region there are tons and 
tons of possible product mixes. 
There's no way you re going to get 
me to try them all. 


You don’t have to try them all. 

Because both Microsoft Excel and 
OpenOffice have a handy little function 
that makes short order of optimization 
problems. Just turn the page to find out 
how … 
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soffvi^are crunches your numbers 





8 

Rubber pellets 




10 

Duck 

Needed per unit 
100 

Used 

1000Q 


u 

12 

Rsh 

125 

12S0Q 


13 

Total pellets used 

225QQ 



14 

Pellet supply 

500DQ 


15 









Your spreadsheet does optimization 

Microsoft Excel and OpenOffice both have a handy little 
utility called Solver that can make short order of your 
optimization problems. 

If you plug in the constraints and write the objective 
function, Solver does the algebra for you. Take a look at 
this spreadsheet, which describes all the information you 
received from Bathing Friends Unlimited. 


www. headfirstlabs. com/books/hfda/ 
bathing—friends—unlimited.xls 



丁 ^clls show a pv-odudt 
你 ix you mahu-fad-tuv-c 

lOO du^ks 3hd -fish cadh. 


Wtrts 

Solvcv. 


This box. sV>o>ws youv* 
vubbcv supply. 


TWis bo% sVioy/s you\r pv-o-f i*b- 


you mdkc lOO 
^ks at lOO \rubbcv 
lets pev- dudk, you 
100,000 pcllcis. 


There are a few simple formulas on this spreadsheet. First, 
here are some numbers to quantify your rubber needs. 

The bath toys are made out of rubber pellets, and cells 
B10 : B11 have formulas that calculate how many pellets 
you need. 

Second, cell B20 has a formula that multiplies the count of 
fish and ducks by the profit for each to get the total profit. 



Take a look ai iii i-P you use 0^0^'iCc 

or i-f Solvcv* is^-t oy\ youv 


Try clicking the Solver button under 
the Data tab. What happens? 
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optimization 


Sharpen your pencil 


Let’s take a look at the Solver dialogue box and figure out how it 
works with the concepts you’ve learned. 


Draw an arrow from each element to where it goes in the Solver 
dialogue box. 


Decision VcirlclUes 

The o( 

dudks -fco make 



Rubber and time 


Constraints 



Objective 


^ P\ro-P'i*t 



-to y/iicv*c i*t should oy\ "the Solvcv*. 


Where do you think the objective function goes? 
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七 ’ I 


oy\ is m 


•this dell 


solver configuration 

|harpen your pencil 
k Solution 


How do the spaces in the Solver dialogue box match up with the 
optimization concepts you’ve learned? 


Draw an arrow from each element to where it goes in the Solver 
dialogue box. 


Decision VclticlUes 


Constrciirits 


Objective 



乙 alls youv objcdiivc 

av-^c*t Cell. 


The dc 匕 isio 的 variables 
av-c -the values you 

y/ill 匕 *to -f md 

you\r objective- 


bo%... y\o bi^ suv-pvisc iiicvc! 


Where do you think the objective function goes? 

The objective -fur\d*tioh ^oes m a dell oh *thc spreadsheet ar\d ve*tu\nr\s *thc objective as -the result 
The objective *tha*t *this objective -fu^d-tioh daldula*bes is 七 he *bo*tal pro-fi-t- 
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optimization 



TesT DriVq 


Now that you’ve defined your optimization 
model, it’s time to plug the elements of 
it into Excel and let the Solver do your 
number crunching for you. 


Set your target cell to point to your 
objective function. 


❺ Find your decision variables and add 
them to the Changing Cells blank. 


Add your constraints. 


o 


Click Solve! 


youv* vubbcv* 
d 。 灼 七 . 


Do 妁’七 -Povgct youv- 
dohs-tvaihts/ 




felHfl 






UiTTyi 

£Mj- 






㈣ JmI 


謝 k 


^ ■■神 


A 


B 


C 


D - 


2 

~Y 

4 

5 

6 
? 
B 

9 

10 
n 
12 
13 

15 

16 
17 
13 

19 

20 
21 
22 

23 

24 

25 

26 
17 


Bathing Friends Unfimited 
Manufacturing plan far December 


Ccunt 

Duck 

Fish 


LOO 

10C 


Rubber pellets 

Duck 

Fish 

Tata 】 pellets used 
Pellet supply 


Needed per unit Used 

100 100DD 

125 izsm 

22 SCO 
5 0000 



Wkat kappens wken 
you click Solve? 
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so/w does the grunt work 
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Solver crunched your optimization 
problem m a snap 

Nice work. Solver took all of about a 
millisecond to find the solution to your 
optimization problem. If Bathing Friends 
Unlimited wants to maximize its profit, it need 
only manufacture 400 ducks and 80 fish. 


Solvcv- *tv*'icd ou 七 a o( 
Gouvrt values dir>d 
or\ts 七 ha 七 


What’s more, if you compare Solver’s result 
to the graph you created, you can see that the 
precise point that Solver considers the best is on 
the outer limit of your feasible region. 



Looks like youVc usih^ 
all you\T \rubbc\r, -fcoo. 





J 




Cn IdfcHTvi ■才 Bih 
Ur 


jJ r 


-4 




■> 

- iKrtlB 

C^uwi k 


r ilia iJ 

^oinn 



<x> 

• 'J-j 





A 



c 

o i\ 


1 Bathing Fnend^ Unlimited 

2 Manufacturing plan fbr Dec&mbref 

3 ^ 


4 Count 

5 Duck 
5 Fish 


40D 

6D 


pellets ： 


Needed unit Used 


10 

Duck 


100 

4D_ 

11 

12 

13 

Fish 

Total used 


125 

SDOOD 

1 ODD-0 

14 

Pc-1 tl ^uu^ly 


咖 0 。 


IS 

Ifi 

"T7 

Dutck 





Fish 

s 








20 

Tola] profit 

h 

2,320 1 *t 


Zt I 、 

22 





m a m 

H MwU S^Htl bJ-wU 


■疆 







tteve’s youv solution. 



Wo'ts *biic pv*o-f'i*b 
you C^y\ 


Looks like great work. 
Now how did you get to 
that solution again? 


Better explain to the 
client what you’ve 
been up to... 




o o o o 
o o o o 

4 3 2 

s>l〕nQ 




































optimization 


Sharpen your pencil 


How would you explain to the client what you’re up to? Describe 
each of these visualizations. What do they mean, and what do 
they accomplish? 




c 

D> E 

1 

5<iihriig FrrtfJi(/s Unh”'(tcd 

M jrMif^e^jrjrfrg pli-an fipr Cteccrnbor 



3 




A 

S 

■Cmnil 

Click 

4D0 


& 


BQ 



RuLImji pel lets 

««-r u^ii Uhc 


Dijck 



Fi?h 

135 


T«til pfllliti bt«d 

50DDD 


P*N*t fypply 

专咖 d 



LlnKl proHl 

Duck 

FHH 

TeAs I profit 


i 
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(hSCk 

Fhh 

T^til pellt^s uuni 
Ptllet suppl y 

Ui 胃 il pF>u4it 

E>u(k 

^tih 

Tg|#l ppurfit 


8 *tijbl>?i 

9 



The shaded par-t o+ 七 his ^rapli shows dll 七 he possible 
dudk/-fish p\rodud*t rn\%ts ^ivcr\ our dohS*bra’m*ts, y/KidK 
arc \rcprcsch*tcd by -the dashed Irncs. But this dha\rt 
does r\o*t po*m*t ou 七 *thc solution i*tsd-f. 


_r \mll iJ»d 
TUQ 
125 



^Dca 


5 

4 


2 謂 1 



interpret optimization results 


Sharpen your pencil 

K Solution 


How did you interpret your findings to your client? 




This spvcadshcc*t shoy/s *thc product mi% Computed by 




most pv*o+i*t while s*tay'ma msidc our dohS*bram*ts. 


r^ahu-fad*tu\r*mO| 午 00 dudks Shd 00 -fish Produces *thc 


B^Ccl to be *the oP*tii^uinr\. Or dll possible p\rodud*t mn\%cs, 





o o o o 
o o o o 

4 3 2 
s>pnQ 
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optimization 


Profits fell through the floor 


You just got this note from Bathing 


Thcv-c av-c lots o-f du^ks ovcv*/ 


Friends Unlimited about the 
results of your analysis … 


From: Bathing Friends Unlimited 
To: Head First 

Subject: Results of your “analysis” 

Dear Analyst ， 

Frankly, we’re shocked. We sold all 80 of 
the fish we produced, but we only sold 20 
ducks. That means our gross profit is only 
$420, which you might realize is way below 
the estimate you gave us of $2,320. Clearly, 
we wanted something better than this. 

We haven’t ever had this sort of experience 
before with our duck sales，so for the 
moment we’re not blaming you for this until 
we can do our own internal evaluation of 
what happened. You might want to do your 
own analysis, too. 

Regards, 

BFU 





This is pretty bad news. The fish sold out, 
but no one’s buying the ducks. Looks like you 
may have made a mistake. 




How does your 
model explain 
this situation? 
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models have limitations 


Your model only describes 
what you put into it 


Your model tells you how to maximize profits 
only under the constraints you specified. 

Your models approximate reality and are never 
perfect, and sometimes their imperfections can 
cause you problems. 


It’s a good idea to keep in mind this cheeky 
quote from a famous statistician: 



a lo*b w\OV-C *to 
v-cality model- 



Bu-t docs i-t rwa-ttev? 


” All models are wrong ， tut some are useful.” 


Your analytical tools inevitably simplify reality, 
but if your assumptions are accurate and 
your data’s good the tools can be pretty reliable. 


Your goal should be to create the most useful 
models you can, making the imperfections 
of the models unimportant relative to your 
analytical objectives. 



- George Box 


So how will I know 
if my model has the 
right assumptions? 
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optimization 


Calibrate your assumptions 
to your analytical objectives 

You can’t specify all your assumptions, but if 
you miss an important one it could ruin your 
analysis. 


You will always be asking yourself how far you 
need to go specifying assumptions. It depends 
on how important your analysis is. 



How important is your analysis 


I Who cares? Don’t ■ 
I sweat it. Give it a ■ 
I moment or two of your ■ 
I time. ■ 


How far skoulct you 
gfo cataloguing your 
assumptions? 



Write down everything 
you think you know 
and everything you 
think you don’t know. 



rpen your pencil 


What assumption do you need to include in order to get 
your optimization model working again? 
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you need demand prediction 

Sharpen your pencil 
^ ^ Solutian 


Is there an assumption that would help you refine your model? 


Theve’s *m -the duv-\rch*t 眯 。 del *tha*t says what people will actually buy . The model describes 

•time, rubber, dr\d prcvfi*^ bu*t m order -for model bo work, people would Kavc *bo buy everythmj 
y/C rmdke- Bu*t> as y/c saw, -this is^*t so y/c Y\ttd ^v\ assureJ>*tioh dbou*b wha*t fcoflc y/ill buy. 


there^a 

Dumb 


are no o 

Questions 


What if the bad assumption were true, and people would 
buy everything we manufactured? Would the optimization 
method have worked? 

譬 

• Probably. If you can assume that everything you make will sell 
out, then maximizing your profitability is going to be largely about 
fine-tuning your product mix. 

But what if I set up the objective function to figure out 
how to maximize the amount of ducks and fish we made overall? 
It would seem that, if everything was selling out, we’d want to 
figure out how to make more. 

That’s a good idea, but remember your constraints. Your 
contact at Bathing Friends Unlimited said that you were limited in the 
amount of fish and ducks you could produce by both time and rubber 
supply. Those are your constraints. 

Optimization sounds kind of narrow. It’s a tool that you 
only use when you have a single number that you want to 
maximize and some handy equations that you can use to find 
the right value. 


But you can think of optimization more broadly than that. The 
optimizing mentality is all about figuring out what you want and 
carefully identifying the constraints that will affect how you are able 
to get it. Often, those constraints will be things you can represent 
quantitatively, and in that case, an algebraic software tool like Solver 
will work well. 

So Solver will do my optimizations if my problems can be 
represented quantitatively. 

A lot of quantitative problems can be handled by Solver, 
but Solver is a tool that specializes in problems involving linear 
programming. There are other types of optimization problems and a 
variety of algorithms to solve them. If you’d like to learn more, run a 
search on the Internet for operations research. 

Should I use optimization to deal with this new model, will 
we sell people what they want? 

Yes, if we can figure out how to incorporate people’s 
preferences into our optimization model. 
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optimization 



E%€RciSe 


Is there a pattern in 
ducks didn’t sell we 


TWis sales data is -fov v^ole ?bbev 
■toy ^dus-tv-y, r\o*t jus 七 BFU, so ，七 s a 
5 ood mdidaUv y/V^at people to 

buy dr^a tv^cy f\rc-Pcr -to buy it 


Do you see 5hy rhOh-th--to-i^oh-tli patteas? 



Hcv-c^s *tlic mos 七 你 Oirrth, 

y/V>er> cvcvythm^ ^ oy \^ 


Here’s some historical sales data for rubber fish and ducks. 
With this information, you might be able to figure out why 
no one seemed interested in buying all your ducks. 



this! 

w 

平 


辛 


www. headfirstlabs. com/books/hfda/ 
historical sales data.xls 




the sales over time that hints at why 
II last month? 
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more ducks, less fish 



What do you see when you look at this new data? 



Is there a pattern in the sales over time that hints at why 
Ducks didn’t sell well last month? 

Dudk sales dhd -fish sales seem -bo go m opposi-tc 
dircd*ti OhS. OY\tS up, *thc o-thev^s dowh. Las*t 

morrth, cvcryohc waited -fish. 


Theme a\rc big d\rops ih sales cvcv-y 山 hu 釙 y. 


Here’s sv/i-Uh, whcv-c dudks sell well 
and 七 hen -fish jurwp ahaead.. 


Wcv-cs a^othev- switdh/ 
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optimization 




Fish Ducks 


Fish Ducks 


4DD 

3S0 

300 



pisK by\A d^tks 
-toytiicv- 


Dudks 


JFMAMJlASQNDJ.FMAMJlASQNDlFMAMJJ_ASaND 

Month 




What sort of constraint would you add to your 
optimization model to account for the negatively 
linked fish and duck sales? 



Don’t assume that two variables are 
independent of each other. Any 
time you create a model, make sure 
you specify your assumptions about 
how the variables relate to each 
other. 


Watch out for wcgatively 
linked variables 

We don’t know why rubber duck and fish sales 
seem to go in opposite directions from each other, 
but it sure looks like they are negatively linked. 
More of one means less of the other. 


Sometimes, fish are 
down and ducks 
are up. 


Sometimes, ducks 
are down and fish 
are up. 










they have ih 〜沒 si% 

"brehd, with holiday season sales spikes, 
but always Ohc is ahead <^P the othc^. 


But nowhere in 
the data are they 
both up. 



pi as — 
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optimization around demand 



ExGRdSe 


You need a new constraint that estimates demand for ducks and fish for the month in which 
you hope to sell them. 



Looking at the historical sales data, estimate what you think the highest 
amount of sales for ducks and fish will be next month. Assume also 
that the next month will follow the trend of the months that precede it. 
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lA/hai usually happens a-ftev- 
*bo ba*th -toy sales? 


Run the Solver again, adding your estimates as new constraints. For 
both ducks and fish, what do you think is the maximum number of 


-toy do you 七 Wk will 
be OY\ *top 


units you could hope to sell? 
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optimization 
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a new profit estimate 


II 

> 


b ExGRciSe 
OLutlOH 


You ran your optimization model again to incorporate estimates about rubber duck and fish 
sales. What did you learn? 


Looking at the historical sales data, estimate what you think the highest 
amount of sales for ducks and fish will be next month. Assume that the 
next month will be similar to the months that preceded it. 



Wt should pvcpavc - fo\r a bi ^ dvop 
•m Jaytuav'Y sales , a^d i*t looks like 
dudks y/ill still be oy\ 

Wfe pv " ob 3 bly y / o^*t be 3 blc *to sell 
moV"C * bV \ 3 ^ 1^0 dudks . 



lA/c probably v/oh ’ 七 be able 

•fco sell mo\rc *thah 50 - fish . 


Run the Solver again, adding your estimates as new constraints. For example, if 
you don’t think that more than 50 fish will sell next month, make sure you add a 
constraint that tells Solver not to suggest manufacturing more than 50 fish. 
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optimization 


Looks like you wo^-t v\ttA b> 
anyy/V^cv-c r\cav- all you\r v-ubbev. 
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Here’s what Solver returned: 



Wcrcs 七 

-foV* Y\t^ 



}>C 七 estimate 
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It’s y\oi as lav^c as las 七 
mo 灼 七 h’s cs*t*ima*tc> bu*t its 
a lo*t move vcasor>ablc! 
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plan implemented 


Your wgw plan is 
working like a charm 

The new plan is working brilliantly. 
Nearly every duck and fish that comes 
out of their manufacturing operation 
is sold immediately, so they have no 
excess inventory and every reason to 
believe that the profit maximization 
model has them where they need to be 

Ko*t "too shabby 






From: Bathing Friends Unlimited 
To: Head First 
Subject: Thank you!!! 

Dear Analyst, 

You gave us exactly what we wanted，and 
we really appreciate it. Not only have you 
optimized our profit, youVe made our 
operations more intelligent and data-driven. 
We’ll definitely use your model for a long 
time to come. Thank you! 

Regards, 

BFU 

P S. Please accept this little token of our 
appreciation, a special Head First edition of 
our timeless rubber duck. 


0 


£y>joy youv dudk| 
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Good job! One question ： the model 
works because you got the relationship 
right between duck demand and fish 
demand. But what if that relationship 
changes? What if people start buying them 
together, or not at all? 









optimization 


Your assumptions arc based 
ow aw evcr-chawgmg reality 

All your data is observational, and you don’t know what 
will happen in the future. 

Your model is working now, but it might break 
suddenly. You need to be ready and able to reframe 
your analysis as necessary. This perpetual, iterative 
framework is what analysts do. 


ky>oy/s y/ha 七 *torwov\ro>w 
dould have siovc- 



l*f 七 he relationships youv 

vaviablcs "torwovvov/, youll 

need to overhaul youv- model- 


Be ready to ckange your moctel! 
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4 delta Visuali^citlon 




Pictures make you smarter ♦ 



You need more than a table of numbers. 

Your data is brilliantly complex, with more variables than you can shake a stick at. Mulling 
over mounds and mounds of spreadsheets isn’t just boring; it can actually be a waste 
of your time. A clear, highly multivariate visualization can in a small space show you the 
forest that you’d miss for the trees if you were just looking at spreadsheets all the time. 


this is a new chapter 
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yoi/Ve in the army now 


New Army needs to 
optimize their website 


New Army is an online clothing retailer that 
just ran an experiment to test web layouts. For 
one month, everyone who came to the website 
was randomly served one of these three home 
page designs. 


ttevVs Home 



New Army 

Men’s 

Women’s 

Children’s 

Pets 




Home Pay 养 Z 



This is -theiv- because 

it’s -the s-tylcshcct they’ve 

bcc^ usmj up "to irtov/- 



They had their experiment designers 
put together a series of tests that 
promise to answer a lot of their 
questions about their website design. 

What they want to do is find the best 
stylesheets to maximize sales and get 
people returning to their website. 


Home Pay 养孓 
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data visualization 


The results are iw, but the 
information designer is out 

Now that they have a store of fantastic data from 
a controlled, randomized experiment, they need 
a way to visualize it all together. 

So they hired a fancy information designer 

and asked him to pull together something that 
helped them understand the implications of their 
research. Unfortunately, all did not work out as 
planned. 


We got a lot of crap back from the 
information designer we hired. It didiVt help 
us understand our data at all, so he got the 
ax. Can you create data visualizations for us 
that help us build a better website? 


O 


o 



You’ll need to redesign the visualizations for 
the analysis. It could be hard work, because 
the experiment designers at New Army are 
an exacting bunch and generated a lot of 

solid data. 


But before we start, let’s take a look at the 
rejected designs. We’ll likely learn something 
by knowing what sort of visualizations won’t 
work. 


Let’s take a look at tke 
rejected designs … 
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dubious designs 


The last iwformatiow designer 
submitted these three infographics 

The information designer submitted these 
three designs to New Army. Take a look at 
these designs. What are your impressions? 

Gan you see why the client might not have 
been pleased? 


^cyv/ov*d dlidks... y/V>a*t 
does 


Tk siic <^P -the text must 

have Sorwcth'ma -to do with 
humbev o\ dlidks. 


blue 


m 


New Army 
favorite keyword 
clicks 




7 

ILC? f\^y VVV^I V4 

glasses|dCKG ^ 

"shorts bootsCO 


V^>u t^y\ nr»akc "tag dlouds like this 
-Pov -P\rcc at hiip//www.wo\rdle.he 七 . 


Looks like *tK'is 
dhar 七 measuves hoy/ 
ma 朽 y visits cadV> 
home pa^c ^o-t. 



-= 5 > 



It seems that thcyVc 

all about "the sdr»\e. 
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data visualization 



l/Vhd't do those 
S\r\roY/s trncdjn? 




O^C, lots o( avvov/s 
oy \ *this oy>c- 



What data is behind the visualizations? 


These visualizations a\rc dc-f mi-tcly 
-flashy, but whats belVmd -them? 


is 


'What is the data behind the visualizations? 
the very first question you should ask when 
looking at a new visualization. You care about 
the quality of the data and its interpretation, 
and you’d hate for a flashy design to get in the 
way of your own judgments about the analysis. 


d ya jot batk 



What sort of data do you think 
is behind these visualizations? 


Typical paths through the New Army website 
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let’s see that data 


Show the data! 

You can’t tell from these visualizations 
what data is behind them. If you’re the 
client, how could you ever expect to be 
able to make useful judgments with the 
visualizations if they don’t even say clearly 
what data they describe? 

Show the data. Your first job in creating 
good data visualizations is to facilitate 
rigorous thinking and good decision 
making on the part of your clients, and 
good data analysis begins and ends with 
thinking with data. 



V) 


New Army 
favorite keyword 
clicks 


Typical paths through the New Army website 




hese Avaphids -Pi-t a \oi 



dem ’ 七 know wViat's loeWmd 


ou just di 
bem until tells you 




^[y\A these yaphs av-c ^o*t solutions 
the pv-oblcms o-f Now hv 叫 . 


ttcv-c a\rc some o( /Vcw 
data sheets. 


New Army’s actual 
data, however, is really 
rich and has all sorts of 
great material for your 
visualizations. 


This is y/iiat d\\ about 




Toial fa^c hiis by siylcsheci 
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data visualization 


Here's some unsolicited advice 
from the last designer 


You didn’t ask for it, but it appears that 
you’re getting it anyway: the outgoing 
information designer wants to put in 
his two cents about the project. Maybe 
his perspective help … 


_ thats W 
of him -to say. 


Fv-om looks o( the *tablc 
the -fad'mj fay, i*t affca^ 
•that Pa^ is dov-\rcd*t- 


"foo mudV) "to 
visualize » 七 all ， Wh? 


To: Head First 

From: Dan，s Dizzying Data Designs 
Re: Website design optimization project 

Dear Head First, 

I want to wish you the best of luck on the New 
Army project. I didn’t really want to do it anyway, 
so it’s good for someone else to get a chance to 

give it a shot. 

One word of warning: they have a lot of data. 
Too much, in fact. Once you really dig into it, 
you，ll know what I mean. I say, give me a nice 
little tabular layout, and I’ll make you a pretty 
chart with it. But these guys? They have more 
data than they know what to do with. 

And they will expect you to make visuals of 
all of it for them. I just made a few nice charts, 
which I understand not everyone liked, but I’ll 
tell you they，ve set forward an insurmountable 
task. They want to see it all, but there is just too 

much. 

Dan 


Sharpen your pencil 


Dan seems to think that an excess of data is a real problem for 
someone trying to design good data visualizations. Do you think 
that what he is saying is plausible? Why or why not? 
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Too much data is 
wcvcr your problem 

It’s easy to get scared by looking at a 
lot of data. 



But knowing how to deal with what seems like a 
lot of data is easy, too. 

If you’ve got a lot of data and aren’t sure what 
to do with it, just remember your analytical 
objectives. With these in mind, stay focused on 
the data that speaks to your objectives and ignore 
the rest. 


Some *this s-tu-f-f is 
jo'mj *to be usc-ful *fco you. 

some o( i*t w 。 灼’七 
be usc-ful *to you. 





the more data the better 


^Sharpen your pencil 

Solution 


Is Dan being reasonable when he says it’s too hard to do good 
visualizations when there is too much data? 


This is^*t very plausible. The whole po'm*t dd*bol analysis is *to Summarize d^{^, sunr\inna\riz.ih^ 
■tools, like 七 he average o-f d number, will y/ork \rcja\rdlcss o-f wKrthcv you have jus*t d -few 

dd*tcl po*m*ts o\r millions. ^v\d i-f you have d o-f dideveh 七 da 七 3 sets to donaparc to eddh other, 

really ^rca*t- visualizations -fadili*ta*tc -this so\rt of dd*bd dialysis jus 七 like dll 七 he 。七 her -tools. 
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data visualization 


Duh. The problem is not too much 
data; the problem is figuring out how 
to make the data visually appealing. 


Oh, really? Do you think it’s 
your job as a data analyst 

to create an aesthetic 
experience for your clients? 


Making the data pretty 
m J i your problem either 

If the data visualization solves a client’s 



problem, it’s always attractive, whether it’s 
something really elaborate and visually 
stimulating or whether it’s just a plain ol’ 
table of numbers. 

Making good data visualizations is just like 
making any sort of good data analysis. You 
just need to know where to start. 




l/Vha-t do you "th’mk "the 
dienes look'm^ -fov? 







So how do you use a big pile of data with a 
bunch of different variables to evaluate your 
objectives? Where exactly do you begin? 
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compare well 




16 

U 

n 

17 

10 



12 


H 

13 


:: 

14 


17 

16 


IS 

L7 


■ 9 

1 扫 


川 

19 


21 

2 公 


■ ， 




bjaAi 

1 


R 11 


s 

6 


i 

：i 

14 

15 

16 
17 
IS 

19 

20 




_ C P 

Rtwsnue TimeOnSitfi Patteviews 
1, 92 34 3 

2 _89_12_1 

4 




A 

1 

UwrlD 

2 




5 


6 


] 


s 












is 


i& 




IS 


19 




^et m 

—PL 


While New Army has more data than 
these three sheets, these sheets have the 
comparisons that will speak directly to what 
they want to know. Let’s try out a comparison 
now... 


Wt>rt 


s 


Wc\rcs Wo^t Page #Z 


Pata visualization is all about 
making the right comparisons 

To build good visualizations, first identify what are the 
fundamental comparisons that will address your client’s 
objectives. Take a look at their most important spreadsheets: 



n 


Home Pay 养 I 


ncircs Home Rage 


㊁ 。 



Thrnk abou 七七 he 
^orwpav-iso^s ihai -Pul-fill 
youv die^rt’s objectives. 



12 3 4 5 6 ? 9 o 



u 
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data visualization 



This value vcpvcsc^ts the 
Joal Now /W 你 y hds -Po\r 
"the 3vc\r3jc drwourrt af 
rnoY\ty eddh usev- speeds. 


How do the results you see compare to their 
goals for revenue and time on site? 


Sharpen your pencil 


Take look at the statistics that describe the results for Home Page #1. 
Plot dots to represent each of the users on the axes below. 

Use your spreadsheet’s average formula (AVG) to calculate the average 
Revenue and TimeOnSite figures for Home Page #1, and draw those 
numbers as horizontal and vertical lines on the chart. 



辛 


This value \rcf\rcscy>*b i\\c Now 5oals (or -tlic avcira^c 

y>urnbev- mmu-tes cath usev speeds oy \ *thc v/ebsi 七 e. 


this! 


www. headfirstlabs. com/books/hfda/ 
hfda—ch04—home—pagel .csv 




00 - 


08 


09 


0 寸 


CD nueA CD y 
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your scatterplot 


Sharpen your pencil 
(Solutlan 


How did you visualize the Revenue and 
TimeOnSite variables for Home Page #1 ? 



How do the results you see compare to their 
goals for revenue and time on site? 

0 y \ average, 七 he -time people spe^d look'mc^ a*t 七 he website Home Page ^1 
is g\rea*tcv- *thah New s ^oal -for *tha*t sta*tis*ti^- 0 y \ o*thcr ha^d) 
average amouh*t o-f revenue -for eadh user is less *thar\ -their ^oal 
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data visualization 




TVicsc tV^av-b av-c just a ^css. 



■ie\rVs a^o-thcv- -fca^tuv-c 
>-P gv-cai visualizations. 



Your visualization is already more 
useful thaw the rejected ones 

Now that’s a nice chart, and it’ll definitely be useful 
to your client. It’s an example of a good data 
visualization because it … 

■ Shows the data 


■ Makes a smart comparison 

■ Shows multiple variables 


Sunr\inr\3vy 


Home Page #1 


o 




。 〆 
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o°°o 、 

o ° o 




10 20 30 

TimeOnSite 
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So what kind of chart 
is that? And what can 
you actually do with it? 




I 

hits by s-tylcshcci 
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slcffpu*? 

dot— 
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scatterplots for causes 


Use scattcrplots to explore causes 

Scatterplots are great tools for exploratory data 
analysis, which is the term statisticians use to describe 
looking around in a set of data for hypotheses to test. 

Analysts like to use scatterplots when searching for causal 
relationships , where one variable is affecting the other. 

As a general rule, the horizontal x-axis of the scatterplot 
represents the independent variable (the variable 
we imagine to be a cause), and the vertical y-axis of a 

scatterplot represents the dependent variable (which we ttcvc^s a sdattevf lot 



You don’t have to prove that the value 
of the independent variable causes 
the value of the dependent variable, 
because after all we’re exploring the 
data. But causes are what you’re 
looking for. 
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data visualization 


O A B c 

1 LfserlD Revenue- TlencOnS^te 



2 



The best visualizations are 
highly multivariate 

A visualization is multivariate if it compares three or 
more variables. And because making good comparisons 
is fundamental to data analysis, making your 
visualizations as multivariate as possible makes it 
most likely that you’ll make the best comparisons. 

And in this case you’ve got a bunch of variables. 






You V^ave multiple variables. 




How would you make the scatterplot visualization 
you’ve created more multivariate? 




There’s a lot o( oppo' 
■fov £.onr\pd\risor)S hcv*c/ 






7 9 

Q Do 


2 
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make it multivariate 



Home Page #1 
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Show more variables by 
looking at charts together 

One way of making your visualization more 
multivariate is just to show a bunch of similar 
scatterplots right next to each other, and here’s an 
example of such a visualization. 

All of your variables are plotted together in 
this format, which enables you to compare a 
huge array of information right in one place. 
Because New Army is really interested in revenue 
comparisons, we can just stick with the charts 
that compare TimeOnSite, Pageviews, and 
ReturnVisits to revenue. 

Wtrts i\)t -that you dialed- 


This yafhid was ertaitd 
y/i 七 h a sou\rdC softv/av-c 

pv-oyarw called R, whidh you II 
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data visualization 


Sharpen your pencil 


You’ve just created a pretty complex visualization. Look at it and 
think about what it tells you about the stylesheets that New Army 
decided to test. 


Do you think that this visualization does a good job of showing the data? Why or why not? 


Just looking at the dots, you can see that Home Page #2 has a very different sort of spread 
from the other two stylesheets. What do you think is happening with Home Page #2? 


Which of the three stylesheets do you think does the best job of maximizing the variables 
that New Army cares about? Why? 
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analyze the visualization 


^harpen your pencil 

Solution 


Does the new visualization help you understand the comparative 
performance of the stylesheets? 


Do you think that this visualization does a good job of showing the data? Why or why not? 

Eddh do*t oh cadh o-f h*mc panels rcprcscr\*ts o-f d sm^lc user, so cveh 

■though ddia po'm*ts avc summarized *m*(x> averages, you ddh s*till see absolutely ^11 Sccihj 

all -the j>o'm*b makes i 七 easy -to evaluate -the spread, a^d -the average I'mcs r^akc i*t easy *bo see how 
eadh stylesheet pcv--fo\rms \rcla*tive bo eadh other and v-ela*tivc -to New y>als. 


Just looking at the dots, you can see that Home Page #2 has a very different sort of spread 
from the other two stylesheets. What do you think is happening with Home Page #2? 

I 七 looks like Home Pajc 养 Z is -terribly. Comaparcd *to -the o*thcv *two s*tylcshcc*ts, Home 

Page 养 2> is〆 七 *m mudh revenue dhd also ^cr-forms poorly oy \ *thc Time oh Si*tc, views, 

ar\d Return \/isi*b -figures. Every sm^le user statistic is below New s ^oals. Home Page 养 2* is 
-terrible dhd should be *takcr\ oUlme imr«edia*tdy| 


Which of the three stylesheets do you think does the best job of maximizing the variables 
that New Army cares about? Why? 


Home Pa^c 养冬 is *tKc best l/Vhile 养 I pcr-fo\rms above average when i 七 domes *bo mrtrids besides 
Revenue , 养多 is y/ay ahead m -terms o-f \rcvcr\uc. W\\tY\ i*t domes *to Return \/isi*ts, it 1 ! is ahead ； a^d 


■theyVe hcdk-a^d-hcdk ov\ Pa^eviews, bu 七 people spe^di more -ti^c ov\ 七 he si*tc wi-th 养冬 . |*t’s yca*t 
*tha*t ^1 ^c*ts 3 lo*t of vc*tu\nr\ visi*ts, bu*t you av-^uc y/i-th 养 Vs superior revenue. 
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data visualization 


tJiereiare no o 

Dumb Questi9ns 


Q/ What software tool should I use to create this sort of 
graphic? 

Those specific graphs are created in a statistical data analysis 
program called R, which you're going to learn all about later in 
the book. But there are a number of charting tools you can use in 
statistical programs, and you don’t even have to stop there. You 
can use illustration programs like Adobe Illustrator and just draw 
visualizations, if you have visual ideas that other software tools don’t 
implement. 

What about Excel and OpenOffice? They have charting 
tools, too. 

Yes, well, that’s true. They have a limited range of charting 
tools you can use, and you can probably figure out a way to create a 
chart like this one in your spreadsheet program, but it’s going to be 
an uphill battle. 

You don’t sound too hot on spreadsheet data 
visualizations. 

Many serious data analysts who use spreadsheets all the time 
for basic calculations and lists nevertheless wouldn’t dream of using 
spreadsheet charting tools. They can be a real pain: not only is there 
a small range of charts you can create in spreadsheet programs, but 
often, the programs force you into formatting decisions that you might 
not otherwise make. It's not that you can't make good data graphics 
in spreadsheet programs; it’s just that there’s more trouble in it than 
you'd have if you learned how to use a program like R. 

So if I’m looking for inspiration on chart types, the 
spreadsheet menus aren’t the place to look? 


No, no, no! If you want inspiration on designs, you should 
probably pick up some books by Edward Tufte, who’s the authority on 
data visualization by a long shot. His body of work is like a museum 
of excellent data visualizations, which he sometimes calls “cognitive 
art." 

What about magazine, newspapers, and journal articles? 

It's a good idea to become sensitive to data visualization 
quality in publications. Some are better than others when it comes 
to designing illuminating visualizations, and when you pay attention 
to the publications, over time, you’ll get a sense of which ones do a 
better job. A good way to start would be to count the variables in a 
graphic. If there are three or more variables in a chart, the publication 
is more likely to be making intelligent comparisons than if there’s one 
variable to a chart. 

What should I make of data visualizations that are 
complex and artistic but not analytically useful? 

There’s a lot of enthusiasm and creativity nowadays for 
creating new computer-generated visualizations. Some of them 
facilitate good analytical thinking about the data, and some of them 
are just interesting to look at. There’s absolutely nothing wrong 
with what some call data art. Just don’t call it data analysis unless 
you can directly use it to achieve a greater understanding of the 
underlying data. 

So something can be visually interesting without being 
analytically illuminating. What about vice versa? 

That’s your judgement call. But if you have something at stake 
in an analysis, and your visualization is illuminating, then it’s hard to 
imagine that the graphic wouldn’t be visually interesting! 


Let’s see what the 
client thinks... 
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communicate with your client 


The visualization is great but the 
web guru's wot satisfied yet 


You just got an email from your client, the web 
guru at New Army, assessing what you created 
for him. Let’s see what he has to say … 



a 

vcaso^ablc 

^ucs-tio^. 



Head First 

New Army Web Guru 
My explanation of the data 

Your designs are excellent and we’re 
pleased we switched to you from the other 
guy. But tell me something: why does 
Home Page #3 perform so much better 
than the others? 

All this looks really reasonable, but I still 
want to know why we have these results. 
I’ve got two pet theories. First, I think that 
Home Page #3 loads faster, which makes 
the experience of the website more snappy. 
Second, I think that its cooler color palette 
is really relaxing and makes for a good 
shopping experience. What do you think? 



HVs shov**t 

you do y/rtii his vc^ucs-t? 


Looks like youv 
dliOrrt has some 
ideas his ovm 

abou 七 >why 七 he 

daia looks 七 
y/ay i*t looks. 


He wants to know about causality. 

Knowing what designs work only takes him so 
far. In order to make his website as powerful 
as possible, he needs some idea of why people 
interact with the different home pages the way 
they do. 

And, since he’s the client, we definitely need to 
address the theories he put forward. 
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ftood visual designs help you 
think about causes 


data visualization 


Your and your client’s preferred model will 
usually fit the data. 


This model you 

-Pavo\ri*tc hypothesis or 
o( 七 he data. 



乙财兄 the model -Pits... thals 

why i 七 sccr^s most plausible h> you. 


But there are always other possibilities 


especially when you are willing to get 
imaginative about the explanations. 
What about other models? 


This model -f its, bool 



shape, bu 七 it *to*tally a 匕 dommoda*tes d3*t3- 


r^odcl 乙扣’七 be tv-uc. 


You need to address alternative causal models 
or explanations as you describe your data 
visualization. Doing so is a real mark of integrity: it 
shows your client that you’re not just showing the version 
of the story that you like best: you’re thinking through 
possible failure points in your theories. 
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the boss r s hypotheses 


The experiment designers weigh m 

The experiment designers saw the web guru’s 
theories and sent you some of their thoughts. 

Perhaps their input will enable you to evaluate 
the web guru’s hypotheses about why some 
home pages performed better than others. 


To: Head First 

From: New Army experiment designers 
Re: The boss’s ideas 

He thinks that page loads count? That 
could be. We haven’t taken a look at the 
data yet to see for sure. But in our testing, 
#2 was the fastest, followed by #3, and 
then #1. So, sure, he could be right. 

As for the cooler color palette, we kind 
of doubt it. The color palette of Home 
Page #3 is coolest, followed by #2, then 
#1, by the way. There’s research to show 
that people react differently, but none of it 
has really persuaded us. 


*thciv v-cspo^sc *to 
■the hypothesis. 


ttevVs y/ha 七七 he 灼七 

dcsi^cv-s *tKmk abou*t 
七 he iiypo-tKcsis. 
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data visualization 
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Hypothesis 2: The relaxing, cool color 
palette of Home Page #3 accounts for why it 
performed best. 
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^Sharpen your pencil 


Let’s take a look at the data to see whether the bosses hypotheses fit. 
Does the data fit either of the hypotheses? 


Hypothesis 1 : The snappy performance of 
snappy web pages accounts for why Home 
Page #3 performed best. 



Po y/cb yu’s 
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Home Page #3 
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Do v/cb ^uvus 
V>yfo*thcscs *th*is data? 


Hypothesis 1 :The snappy performance of 
snappy web pages accounts for why Home 
Page #3 performed best. 

This be *tv"uc, smde 养 3 is^*t *thc 
-fas-tcs-t ； *to 

dcsi^crs. |*t miojlvt be *tha*t as OjChcral 
rule people prefer -faster pages, but page 
load speed da^*t C%pla*m 养 ?>’s suddess *m 

■the dor\*tc^*t o-f -this c%pc\rimcr\*t- 


Hypothesis 2: The relaxing, cool color 
palette of Home Page #3 accounts for why it 
performed best. 

This hypothesis -fi*ts 七 he dd*ba* Home Pa^c 
*} is *the highest — dhd 
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da*ba dor/ 七 prove 七 -the Color palrt*tc 
is *thc veasoh 七 养 ? > pcv--fo\rmcd so well) 
bu*t i*t -fits *thc hypothesis. 


fitting hypotheses 


Sharpen your pencil 
(Solutlan 


How well did you find the web guru’s hypotheses 
to fit the data? 
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data visualization 


The experiment designers have 
some hypotheses of their own 


They’ve had an opportunity to take 
a look at your scatterplots and sent 
you some of their own thinking 
about what’s going on. These 
people are data junkies, and their 
hypotheses definitely fit. 


Wtrts y/V>a*t *tV>c 

dcsi^cvs y/air>*t *to do 灼 wt. 


Maybe it's av\A layout 


Maybe iVs ii*icv-av-thy 
of *tV>c pa^cs. 


To: Head First 

From: New Army experiment designers 

Re: We don’t know why Home Page #3 is stronger 

We're delighted to hear that #3 is the best, but we really don’t 
know why. Who knows what people are thinking? But that is 
actually OK: as long as we’re showing improvement on the 
business fundamentals, we don’t need to understand people 
in a deep way. Still, it’s interesting to learn as much as we 
can. 

The stylesheets are really different from each other in many 
ways. So when it comes to isolating individual features that 
might account for the performance differential, it's hard. In the 
future, we'd like to take Home Page #3 and test a bunch of 
subtle permutations. That way, we might learn things like how 
button shape or font choice affect user behavior. 

But we conjecture that there are two factors. First, Home 
Page #3 is really readable. We use fonts and a layout that 
are easy on the eyes. Second, the page hierarchy is flatter. 
You can find pretty much everything in three clicks, when 
for Home Page #1 it takes you more like seven clicks to find 
what you want. Both could be affecting our revenue, but we 
need more testing to say for sure. 


Sharpen your pencil 


On the basis of what you’ve learned, what would you recommend 
to your client that he do regarding his web strategy? 
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happy client 


Sharpen your pencil 
k Solution 


What would you tell your client to do with his website on the 
bases of the data you visualized and the explanatory theories you 
evaluated? 


S*tidk y/i 七 h Home Pay 养孓 dhd *tcst -fo\r -fihcv—framed clcmch*ts <^f 七 he user’s c^cpcvichdc, like variable 
havi(ja*tioh, s*tylC) dhd doh 七⑶七 . TIiCVC d\re d buhdli o-f di*W*c\rCh*t possible C^J>laha*tiohS -fo\r 非 Z’s 
pC\r-fo\rmahdc 七 ha 七 should be ihves-tijated dhd visualized) bu*t its dlcav 七 ha 七养冬 is 七 he vid*bo\r here. 


The client is pleased with your work 


You created an excellent visualization 
that enabled New Army to quickly and 
simultaneously assess all the variables 
they tested in their experiment. 

And you evaluated that visualization in 
light of a bunch of different hypotheses, 
giving them some excellent ideas about 
what to test for in the future. 



Very cool. I agree with your assessments of 
the hypotheses and your recommendation, 
rm implementing Home Page #3 for our 
website. Job well done. 
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data visualization 


Orders are coming m 
from everywhere! 

Because of the new website, traffic is 
greater than ever. Your visualization of 
the experimental results showed what 
they needed to know to spruce up their 
website. 


(s/ow /W 你 y swt y ou as a *tha^k-you. 



Even better, New Army has embarked 
on a continuous program of 
experimentation to fine-tune their new 
design, using your visualization to see 
what works. Nice job! 


Av* 你 Y’s of 七 1 你 v/cbsrtc 

•is v-cally o^P. 
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5 KypotKesls testing 


^SayitaWtso 



The world can be tricky to explain. 

And it can be fiendishly difficult when you have to deal with complex, heterogeneous data 
to anticipate future events. This is why analysts don’t just take the obvious explanations 
and assume them to be true: the careful reasoning of data analysis enables you to 
meticulously evaluate a bunch of options so that you can incorporate all the information 
you have into your models. You’re about to learn about falsification, an unintuitive but 
powerful way to do just that. 


this is a new chapter 





your new client electroskinny 


&imme some skin... 


You’re with ElectroSkinny, a maker of 
phone skins. Your assignment is to figure 
out whether PodPhone is going to release 
a new phone next month. PodPhone is a 
huge product, and there’s a lot at stake. 


£lct*tvoSk*my>y hifs 七 e 



PodPhone will release a phone at some point 
in the future, and ElectroSkinny needs to start 
manufacturing skins a month before the phone 
is released in order to get in on the first wave of 
phone sales. 

If they don’t have skins ready for a release, their 
competitors will beat them to the punch 

and sell a lot of skins before ElectroSkinny 
can put their own on the market. But if they 
manufacture skins and PodPhone isn’t released, 
they’ll have wasted money on skins that no 
one knows when they’ll be able to sell. 
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hypothesis testing 


Whew do wg start making 
new phone skins? 


The decision of when to start 
manufacturing a new line of skins is a 
big deal. 


youv* dierrt, 七 he 

6lcd*tv*oSk'mir>y CB-0- 



l*f PodPhohC \rclcascs, 
we ouv- sk'ms 


PodPhone releases are always a surprise, so 
Electro Skinny has to figure out when they’re about 
to happen. If they can start manufacturing a month 
before a PodPhone release, they’re in great shape. 
Gan you help them? 


These arc *thc situations 
we v/a 灼七 七 © avoid. 


I*P 七 heve's a delay, bu-t Elc^oSkm^y has〆 七 
sta\rtcd rwa^u-fa^tu\r*mj, ihcyVc *m yea 七 shape. 


*%ibarpen your pencil 


What sort of data or information would help you get 
started on this analytical problem? 
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you have scant data 


Sharpen your pencil 

Solution 


What do you need to know in order to get started? 


PodPhohC y/ar\*ts releases io be d sur^>\risc, so -thcyll probably -take measures to avoid Icttmj 
people -figure ou 七 whch -those releases happen. Well heed some so\rt o-f msiglvt *m*bo hoy/ -they *tlVmk 
about 七 heir releases, dhd y/cll heed to kr\ow y/ha*t kihd o-f *m-fo\rinaa*tioh -they use m *thciv dedisioh. 


PodPhowc doesn't want you 
to predict their wext move 

PodPhone takes surprise seriously: they really 
don’t want you to know what they’re up to. So you 
can’t just look at publicly available data and expect 
an answer of when they’re releasing the PodPhone 
to pop out at you. 


PodPhone knows youll sec dll "this ih-Povrwa-tio^, so "they 
七 >w3h*t a 的 y J i 七 to lei cm -theiv- \rdcasc date- 


y/oh 


These daid fo'mis really arc^*t 

^o*mg *fco be o-f vt\aC\\ hclf... 


You need to figure out how to compare the data 
you do have with your hypotheses about when 
PodPhone will release their new phone. But first, 
let’s take a look at the key pieces of information 
we do have about PodPhone … 



..unless you’ve jo*t a v-cally smav-t 
way *bo *tiVmk dbou*b 
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hypothesis testing 


Here's Gverythiwg we know 

Here’s what little information Electro Skinny has been 
able to piece together about the release. Some of it is 
publicly available, some of it is secret, and some of it 
is rumor. 


PodPhone has 
invested more in 
the new phone 
than any other 
company ever 
has. 


There is going to 
be a huge increase 
in features 
compared to 
competitor phones. 


CEO of PodPhone 
said “No way we’re 
launching the new 
phone tomorrow ■” 


There was just a 
big new phone 
released from a 
competitor 


The economy 
and consumer 
spending are both 
up, so it’s a good 
time to sell phones. 


There is a rumor 
that the PodPhone 
CEO said there’d 
be no release for 
a year. 



Internally, we doiVt expect a release, because their 
product line is really strong. They II want to ride out their 
success with this line as long as possible. I’m thinking we 
should start several months from now... 


CBOof 

Elc£.*tv"oSkmir>Y 






Do you think her hypothesis makes sense 
in light of the above evidence we have to 
consider? 
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a hypothesis fits 


ElGctroSkiwwy's analysis 
does fit the data 


The GEO has a pretty straightforward account of step- 
by-step thinking on the part of PodPhone. Here’s what 
she said in a schematic form: 


Wcrts \wha*t 6lcd*tvoSk*my>y CB-0 *tWks 

is bo be PodPWe’s 七 Wknr^. 



"Tliis r^odcl o-p "tKc y/oV"ld 
•Pits you\r cvidch^c- 


This model or hypothesis fits the evidence, because 
there is nothing in the evidence that proves the model 
wrong. Of course, there is nothing in the evidence that 
strongly supports the model either. 


No*th*m5 here -that A\sayrtts 
with £lcd*tv-oSkmy>ys jiyfotKcsis. 


PodPhone has 
invested more in 
the new phone 
than any other 
company ever 
has. 


There is going 
to be a huge 
increase in 
features compared 
to competitor 
phones. 


CEO of PodPhone 
said “No way we’re 
launching the new 
phone tomorrow.” 




Ther __ 
big new phone 

released from a 
— 


The economy 
and consumer 
spending are both 
up, so it’s a good 
time to sell phones. 


1 There is a rumor 
that the PodPhone 
CEO said there’d 
be no release for 
a year. 


Seems like pretty 
solid reasoning [… 
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hypothesis testing 


ElGctroSkiwwy obtained this 
confidential strategy memo 

Electro Skinny watches PodPhone 
really closely, and sometimes stuff like 
this just falls in your lap. 

This strategy memo outlines a 
number of the factors that PodPhone 
considers when it’s calculating its 
release dates. It’s quite a bit more 
subtle than the reasoning the 
Electro Skinny GEO imagined they 
are using. 


'矿 




PodPhone phone release strategy memo 

We want to time our releases to maximize sales and to 
beat out our competitors. We have to take i] 
variety of factors to do it. 

First, we watch the economy, becaus^A^j^ 11 
overall economic performance dm^^c^umer 
spending, while economic consumer 

spending. And consumer all phone 

sales comes from. But we AiV^Vompetitors are after 

the same pot of consj^i^^ng. Every phone we sell 
is one they don’t y 

^release a phone when they have 


Cdh this mcrwo help you -Pijuirc out 

y/hen a PodPhone v/ill be released? 


We don’t 


of 

C*: siAjplie 
lir^^on ou 


/market. We take a bigger bite out 
if we release when they have a stale 


iers and internal development team place 
our ability to drop new phones, too. 


^harpn your pencil 


Think carefully about how PodPhone thinks the variables mentioned in 
the memo relate. Do the pairs below rise and fall together, or do they go 
in opposite directions? Write aor "-’’in each circle depending on your 


answer. 


Pu 七 a m i-f *thc two 

variables vise -fall 


l/Virrte a W - W si^ i-p ihc variables move m opposite div-edtio^s. 
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linked variables 


■^Sharpen your pencil 


Edcmomy goes up, so does 
do^sumcv- 



In the mind of PodPhone, how are the pairs of variables below 
linked to each other quantitatively? 


|*f a domfc*t*i*tov has a 
rtctY\{, fvodud*t v-clcasc, 
PodPho^c avoids v-clcasm^. 


Competitor 

Product 

Releases 



PodPhone 

Product 

Releases 





Every PodPhone sells is a -that 

■theiv dompctrfcov docs^t sell, vide vcv-sa. 


Variables can be negatively 
or positively linked 

When you are looking at data variables, it’s a good 
idea to ask whether they are positively linked, 
where more of one means more of the other (and 
vice versa), or negatively linked, where more of 
one means less of the other. 

On the right are some more of the relationships 
PodPhone sees. How can you use these relationships 
to develop a bigger model of their beliefs, one 
that might predict when they’re going to release their 
new phone? 



Wert dre d -few of *thc o*thcv- 
v-cla*tio^ships tBv\ be read 
-fvorw PodPho^c^s s-tv-atejy rncrwo- 




TV^CSC arc all positively 1’mked 
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hypothesis testing 


^Sharpen your pencil 




Let’s tie those positive and negative links between 
variables into an integrated model. 

Using the relationships specified on the facing page, 
draw a network that incorporates all of them. 


Tiicsc *two 
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networked causality 

igfcjbarpen your pencil 



How does your model of PodPhone’s worldview look once you’ve 
put it in the form of a network? 


PodPhohc seems -to be 
wa-^hihg the ihtcva^ioh 

or 3 lot o-f v^lridbles. 


Thcvc av*c a <^P thihgs gomg oh kev*e- 

•I 


I Consumer H 
■ Spending ■ 





O^t o( -tVicsc ca^i really 七 

all oi^cr variables. 
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hypothesis testing 


Causes in the real world arc 
networked, wot linear 


Linearity is intuitive. A linear explanation 
of the causes for why PodPhone might 
decide to delay their release is simple and 
straightforward. 


PodPho^c^s stvatc^y w'Cmo su^cs*U 七 ha 七 
七 lieiv* 七 iVmknr^ is mov-C tomflc% 七 ha 内七 Wis. 


.广 

TV’s is way -too simple. 



But a careful look at PodPhone’s strategy report 
suggests that their actual thinking, whatever the 
details are, is much more complex and sophisticated 
than a simple linear, step-by-step diagram would 
suggest. PodPhone realizes that they are making 
decisions in the context of an active, volatile, 
interlinked system. 


As an analyst, you need to see beyond simple models 
like this and expect to see causal networks. In 
the real world causes propagate across a network of 
related variables … why should your models be any 
different? 


O 


So how do we use that to figure out 
when PodPhone is going to release their 
new phone? What about the data? 
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generate hypotheses 


Hypothesize PodPhowc's options 

Sooner or later, PodPhone is going to release a 
new phone. The question is when. 

And different answers to that question are your 
hypotheses for this analysis. Below are a 
few options that specify when a release might 
occur, and picking the right hypothesis is what 
Electro Skinny needs you to do. 


PodPhone has 
invested more in 
the new phone 
than any other 
company ever 
has. 


There is going 
to be a huge 
increase in 
features compared 
to competitor 
phones. 


CEO of PodPhone 
said “No way we’re 
launching the new 
phone tomorrow.” 




ttc\rc a -Pew estimates 
wkh the hew PodPhohc 
be \rclcascd. 


Youll somchov/ dombmc youv 
hypotheses with -this t^\Atr\Ct 

drtd PodPho 灼 e’s 

model -to jet youv 


There was just a 
big new phone 
released from a 
competitor. 


1 The economy 

1 There is a rumor 

1 and consumer 

that the PodPhone 

spending are both 

CEO said there’d 

up, so it's a good 

be no release for 

time to sell phones. 

1 a year. 



HI: 

Release 
will be 



H2: 

Release 
will be next 




H3: 
Release will 
be in six 


Halil 


H4: 

Release will 
be in a year 


Your 

hypotheses 


The hypothesis iha-t wc Consider 
s-t\ro^gcsi v/ill dctcv-rwmc 
£l^t\roSki>rmy’s schedule- 
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hypothesis testing 


You have what you need to 
ruwa hypothesis test 

Between your understanding of PodPhonel 
mental model and the evidence, you have 
amassed quite a bit of knowledge about th( 
issue that Electro Skinny cares about most: 
when PodPhone is going to release their 
product. 

You just need a method to put all this 
intelligence together and form a solid 
prediction. 


Hcvc^s 七 lie vav'iablc 


+ 


Consumer 
Spending ：； 


Competitor 

Sales 


Competitor 

Product 

Releases 


PodPhone 

Sales 


PodPhone 
+ Product 

Releases 



Internal 

development 

activity 


Your Ligf 

prediction 


PodPhone’s mental model 


tteve s >wha 七 

Ele 乙 * broSknrmy’ 

lookup forf 


But how do we do it? We've already seen 
how complex this problem is... with all that 
complexity how can we possibly pick the 
right hypothesis? 
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falsify your way to the truth 


Falsification is the heart 
of hypothesis testing 

Don’t try to pick the right hypothesis; just 

eliminate the disconfirmed hypotheses. 

This is the method of falsification, which is 
fundamental to hypothesis testing. 


Picking the first hypothesis that seems best is called 
satisficing and looks like this: 


I Release I I Release I I Release will I r i H4： n l 

I will be ■ I will be next ■ I be in six ■ I Releasew.il H 

I tomorrow ■ | month ■ ■ months ■ ■ be m a year ■ 


Satisficing is really simple: it’s picking the first 
option without ruling out the others. On the 
other hand, falsification looks like this: 


Tii'is is 


Falsiiication is more reliable* 

■ 麗 ■ I w n xt ■ I » x ■ I 匕黑 _ I ^ ■ 

It looks like both satisficing and falsification 
get you the same answer, right? They don’t 
always. The big problem with satisficing is 
that when people pick a hypothesis without 
thoroughly analyzing the alternatives, they 
often stick with it even as evidence piles up 
against it. Falsification enables you to have 
a more nimble perspective on your 
hypotheses and avoid a huge cognitive trap. 


This is all Ic-ft 



Use falsiiication in liypotkesis 
testing and avoict tke danger 
of satisficingf. 
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hypothesis testing 


r ^harpen your pencil_ 

Give falsification a try and cross out any hypotheses that are 
falsified by the evidence below. 



ave youv hypotheses. 




I/Vhidh oy\cs do youv cvidcr^dc sujjcs-t avc 


Here’s youv- cv'idc^c- 



PodPhone has 
invested more in 
the new phone 
than any other 
company ever 
has. 


There is going 
to be a huge 
increase in 
features compared 
to competitor 
phones. 


CEO of PodPhone 
said “No way we’re 
launching the new 
phone tomorrow ■” 


There was just a 
big new phone 
released from a 
competitor. 


The economy 
and consumer 
spending are both 
up, so it’s a good 
time to sell phones. 


There is a rumor 
that the PodPhone 
CEO said there’d 
be no release for 
a year. 


Why do you believe that the hypotheses you picked are falsified by the evidence? 
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hypotheses eliminated 




vulcs ou*t 


PodPhone has 
invested more in 
the new phone 
than any other 
company ever 
has. 


There is going 
to be a huge 
increase in 
features compared 
to competitor 
phones. 


CEO of PodPhone 
said “No way we’re 
launching the new 
phone tomorrow ■” 


There was just a 
big new phone 
released from a 
competitor. 


TVis t^'\dcv\Ct 
v-ulcs out HI • 


The economy 
and consumer 
spending are both 
up, so it’s a good 
time to sell phones. 


There is a rumor 
that the PodPhone 
CEO said there’d 
be no release for 
a year. 


Why do you believe that the hypotheses you picked are falsified by the evidence? 

HI is dc-f'mi-tcly -falsi-ficd by -the because -the CB-0 has oy\ redord sayi^^ -that -there y/as y\o way 

i*t1l -tomorrow. The CB-0 be bu 七 v/ould be so y/ci\rd C3y\ still rule ou*t HI. 

is -falsi-ficd because PodPhone has pA*t so mudh mohey *m*to phohe. The phohe mijlvt be delayed ov* 
bu 七 unless dompar\y deases -to exist, i*t’s hard *to ima^me *they:d darnel -the pho^c. 
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hypothesis testing 


iJiereictre no o 

Dumb Qjaesti9ns 


Falsification seems like a really 
elaborate way to think about analyzing 
situations. Is it really necessary? 

It's a great way to overcome the 
natural tendency to focus on the wrong 
answer and ignore alternative explanations. 
By forcing you to think in a really formal 
way, you’ll be less likely to make mistakes 
that stem from your ignorance of important 
features of a situation. 

How does this sort of falsification 
relate to statistical hypothesis testing? 

What you might have learned in 
statistics class (or better yet, in Head 
First Statistics) is a method of comparing 
a candidate hypothesis (the “alternate” 
hypothesis) to a baseline hypothesis (the 
“null” hypothesis). The idea is to identify a 
situation that, if true, would make the null 
hypothesis darn near impossible. 

So why aren’t we using that 
method? 

One of the virtues of this approach 
is that it enables you to aggregate 


heterogenous data of widely varying quality. 
This method is falsification in a very general 
form, which makes it useful for very complex 
problems. But it’s definitely a good idea to 
bone up on “frequentist” hypothesis testing 
described above, because for tests where 
the data fit its parameters, you would not 
want to use anything else. 

I think that if my coworkers saw 
me reasoning like this they’d think I was 
crazy. 

They certainly won’t think you're crazy 
if you catch something really important. 

The aspiration of good data analysts is to 
uncover unintuitive answers to complex 
problems. Would you hire a conventionally 
minded data analyst? If you are really 
interested in learning something new about 
your data, you'll go for the person who thinks 
outside the box! 

It seems like not all hypotheses 
could be falsified definitively. Like 
certain evidence might count against a 
hypothesis without disproving it. 

That’s totally correct. 


Where’s the data in all this? Td 
expect to see a lot more numbers. 

Data is not just a grid of numbers. 
Falsification in hypothesis testing lets you 
take a more expansive view of “data” and 
aggregate a lot of heterogeneous data. You 
can put virtually any sort of data into the 
falsification framework. 

What’s the difference between 
using falsification to solve a problem and 
using optimization to solve it? 

They’re different tools for different 
contexts. In certain situations, you’ll want to 
break out Solver to tweak your variables until 
you have the optimal values, and in other 
situations, you’ll want to use falsification to 
eliminate possible explanations of your data. 

OK. What if I can’t use falsification 
to eliminate all the hypotheses? 

That’s the $64,000 question! Let’s see 
what we can do... 



o 


Nice work! I definitely know more now 
than I did when I brought you on board. 
But can you do even better than this? 
What about eliminating two more? 
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beyond falsification 


O 



We still have 3 hypotheses 
left. Looks like falsification 
didtVt solve the whole problem. 
So whafs the plan now? 


How do you choose among 
the last three hypotheses? 

You know that it’s a bad idea to pick 
the one that looks like it has the 
most support, and falsification has 
helped you eliminate only two of the 
hypotheses, so what should you do 


now: 




I Release I I Release will I l p . H4： n l 

I will be next ■ I be in six ■ I please w,II ■ 

I month ■ I months ■ ■ be m a year ■ 




AN 


W\\\cM or\t these v/ill you ultimaicly toY\s\dtr io be stro^ys-t? 
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hypothesis testing 


Sharpen your pencil 


What are the benefits and drawbacks of each hypothesis- 
elimination technique? 


Compare each hypothesis to the evidence and pick the one that 
has the most confirmation. 


Just present all of the hypotheses and let the client decide 
whether to start manufacturing skins. 


Use the evidence to rank hypotheses in the order of which has 
the fewest evidence-based knocks against it. 
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weigh your hypotheses 


Sharpen your pencil 
\ Solution 




Did you pick a hypothesis elimination technique that you 
like best? 


Compare each hypothesis to the evidence and pick the one that 
has the most confirmation. 

This is ddh^erous. The problem is *the *m*fo\rma*tioh | have is *mdomplc*tc. |*t dould be 
is somc*th*rnj really imj>o\rtah*t 七 I dor/ 七 khow. /\hd i-f *tha*t’s *brue, hy^o*thcsis 

based oh I do khow will probably Ojivc w\ror\^ ahsy/e\r. 


Just present all of the hypotheses and let the client decide 
whether to start manufacturing skins. 

This is dc\rta*mly optioh, bu 七 problem wi*th i*t is I’m v\o{, really *tak'm^ 叫 \resy>ohsibili*ty 
-for *thc dor\dlusior\s. |h o*thcr words, fim r\o*t really dd*b'm^ as a dd*bd dhdlys-b as rmudh as someone who 
just delivers da*ta. This is *thc y/impy appv-oadK. 


Use the evidence to rank hypotheses in the order of which has 
the fewest evidence-based knocks against it. 

This oy\c is *the best- fvc already used -falsi-fida-tioh -to v-ulc ou*t sure be *brue. 

Now, cvch -though I v-ulc ou*t my hy^o*thcscs, | still use *thc cvider\dc *bo see whidK 

ohes arc *thc s-tv-ohjest 
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hypothesis testing 



Wait a second. By putting the hypothesis that seems 
strongest at the top of the list, don’t we run the risk 
of satisficing and picking the one we like rather than 
the one thafs best supported by the evidence? 


Not if you compare your 
evidence to your hypotheses 
by looking at its diagnosticity. 

Evidence is diagnostic if it helps you 
rank one hypothesis as stronger than 
another, and so our method will be to look 
at each hypothesis in comparison to each 
piece of evidence and each other and see 
which has the strongest support. 

Let’s give it a shot … 


-ScJiplaT’s Cottiet 



Pia^os*tidi*ty is abilrty cvidc^dc *fco help you 
assess 七 he relative likelihood of -the hypo-thcscs 
youVc dohsidc\r*m^. (-(* cvidc^dc is did^^os 七 id, 
i*t helps you \r3hk you\r hypo-bhcscs. 
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meet diagnosticity 


Piagwosticity helps you find the hypothesis 
with the least disconfirmation 


Evidence and data are diagnostic if they help you 
assess the relative strengths of hypotheses. The tables 
below compare different pieces of evidence with several 
hypotheses. The symbol indicates that the evidence 
supports that hypothesis, while the “ 一 ’’ symbol indicates 
that the evidence counts against the hypothesis. 

In the first table, the evidence is diagnostic. 

TKis cv*idcr>dc is dia^ostid. 



Evidence #1 



This cvidc^dc doubts 
\y\ -favov o( HI... 


The v/cijh-b you ass 咖 *to these 
values av-c v-ijov-ous but 

subjective, so use youv* best 


,"W*t rt v-cally tour\*b 

\ y \ -favov- of ttZ- 


HI 



+ 


++ 


In the second table, on the other hand, the evidence is 
not diagnostic. 


This cvidc^dc does / 七 - 〆 

disdoh-filrrw outvigh 七 but 

i*t leads us "to doubt 


This is not dia^ostid. 



Evidence #2 



HI H2 H3 


+ + + 


- - I 七 scerw like an o-thcvy/isc m-tcv-cstmg pictc o( cvidc^dc, 

but Uhlcss it helps us ou\r hypotheses, its y\oi o( mudh use. 


When you are hypothesis testing, it’s important to 
identify and seek out diagnostic evidence. Nondiagnostic 

evidence doesn’t get you anywhere. Let，S try looking at the 

diagnosticity of our evidence... 
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hypothesis testing 



Take a close look at your evidence in comparison to each of your hypotheses. Use the plus and 
minus notation to rank hypotheses with diagnosticity. 


o 

❺ 


Say whether each piece of evidence supports or hurts each hypothesis. 


Cross out pieces of evidence that aren't diagnostic. 



biggest investment in new phone tech ever. 

There is going to be a huge increase in 
features compared to competitor phones. 

CEO of PodPhone said “No way we’re 
launching the new phone tomorrow.” 

There was just a big new phone released 
from a competitor. 

The economy and consumer spending are 
both up. 

Rumor: PodPhone CEO said there’d be 


no release this year. 
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remove nondiagnostic evidence 


IK 


How did you rank your hypotheses? 


% 


RciSe 

iuitOH 


o Say whether each piece of evidence supports or hurts each hypothesis. 




Cross out pieces of evidence that aren y t diagnostic. 




The *thv-cc 

dir>d t^Y\ 


( cvidc^c av-c y\o{, dia^ostid 


t thvee fictcs o\ cvidc^tc av-c r\ot I 
be i^ovcd "tiVis po.m 七 o 灼 \wavd. 




I Release I I Release will I l p . H4： „ l 

I will be next ■ I be in six ■ I please w,II ■ 

I month ■ I months ■ ■ be m a year ■ 


ire is aojrniD be a nmejnereas 」 
featCTte^bmpar5Chtr&ompelTfCffl?hones. 




There was just a big new phone released 
from a competitor. 

The economy and consumer spending are 
both up. 

Rumor: PodPhone CEO said there’d be 
no release this year. 


PodPhone 七 vies *to avoid V>cad-*to-Kcad 
W\{h a dompctiiov^s Y\t^i as you Icav-^cd 


l/Vc doh’t use this piede o( cvidc^dc -fco 
(a\s\(y Wl av\d Wi, because it's 3 \rumo\r 



+ + 
+ 



+ 



+ 


six. months, the dompetrtov-^s mevj 
migh*t have -faded m populav-i-ty, so itd 
be *tiw>C -foV PodPho^C "to make a move. 

The economy dould be v/o\rsc m a ycav 
-Pv-om r\ov/, so a s-tvo^j e^omomy speaks 
•nr» -Pavov o( -the v-clcasc bc'm<\ soohev-. 
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hypothesis testing 


You caw't rule out all the hypotheses, 
but you caw say which is strongest 

While the evidence you have at your disposal doesn’t 
enable you to rule out all hypotheses but one, you 
can take the three remaining and figure out which 
one has the least disconfirmation from the evidence. 

That hypothesis is going to be your best bet until you 
know more. 


Wken will PoctPkone release a new pkone? 


Today 



This ov\ts 七 he v/cakcs-t 


This oY\ts *tV>c s*tvov>^cs*t. 


This ih 
the middle. 


Tm completely persuaded by this and have 
decided not to manufacture skins in the short 
term. Hopefully, something will come up that will 
let us figure out for sure whether if" happen in 
six months or at some point afterward. 


Looks like youv- analysis 

had the desived c-P-fcd-fc. 




Too bdd WC dould 灼’七 s*ta\rt 

iVd be lu^v-ativc *fco be 

ou*t m -Pvo^-t 3 PodPho^c lau 的 dlv 
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the plot thickens 


You just got a picture message 


Your coworker saw this crew of PodPhone 



employees at a restaurant just now. 

Everyone’s passing around new phones, 

and although your contact can’t get close 
enough to see one, he thinks it might be the 
one. 


Pdss'm^ d\roundl phones? B 败 yore’s 

seen rwodk-ups, bui why ihv-ow 
a pa\rty -Pov modk-ups? 


This is new evidence. 

Better look at your hypothesis grid again. 
You can add this new information to 
your hypothesis test and then run it again. 
Maybe this information will help you 
distinguish among your hypotheses even 
further. 


y/ould dll -tV^CSC ?od?\^OY\t 

employees be ou*t a 
basK ai a restawarrt? 
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hypothesis testing 


Sharpen your pencil 


Do your hypothesis test again, this time with the new evidence. 


There was just a big new phone released 
from a competitor. 

The economy and consumer spending are 
both up. 

Rumor: PodPhone CEO said there’d be 
no release this year. 




++ 

+ 




do^JY\ Y\t^ 




Add new the evidence to the list. Determine the diagnostic strength of 
the new evidence. 



Does this new evidence change your assessment of whether PodPhone 
is about to announce its new phone (and whether ElectroSkinny should 
start manufacturing)? 
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incorporate new data 

r Sharpen your pencil 



Did your new evidence change your ideas about the relative 
strengths of your hypotheses? How? 


There was just a big new phone released 
from a competitor. 

The economy and consumer spending are 
both up. 

There is a rumor that CEO isn’t going to 
release this year at all. 

The dcvdop^ch-t -team is sttv\ a 

huge dclcb\ra*tioh, hold'm^ hew phones. 




++ 

+ 




++ + 



This is ^ bi0 - 


o 

❺ 


Add new the evidence to the list. Determine the diagnostic strength of 
the new evidence. 


Does this new evidence change your assessment of whether PodPhone 
is about to announce its new phone (and whether ElectroSkinny should 
start manufacturing)? 


Dc-f'mi*tdy. k*md hard *to irma^me *the would be dclcb\ra*tmo| ar\d passmj around 
dories o-f phohe i-f *thcy wereh ’ 七 jo*m^ *to release d soon. already ruled ou 七 

d Iduhdh *bomo\r\row, ^Y\d so its really look'rn^ like ttZ is ou\r bcs*t hypothesis. 
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hypothesis testing 


It's a launch! 


Your analysis was spot on, and Electro Skinny 
was had a line of cool new skins for the new 
model of the PodPhone. 


0 



Thanks to you, we totally saw that launch 
coming and were ready for it with a 
bunch of awesome new skins. Whafs more, our 
competitors all thought PodPhone wasn’t going 
to release a new phone, so we were the only 
ones ready and now were cleaning up! 



Nice work! 
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You’ll always be collecting new data. 

And you need to make sure that every analysis you do incorporates the data you have 
that’s relevant to your problem. You’ve learned how falsification can be used to deal 
with heterogeneous data sources, but what about straight up probabilities? The 
answer involves an extremely handy analytic tool called Bayes 1 rule, which will help you 
incorporate your base rates to uncover not-so-obvious insights with ever-changing data. 


this is a new chapter 
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feeling ok? 


The doctor has disturbing news 

Your eyes are not deceiving you. Your doctor 
has given you a diagnosis of lizard flu. 

The good news is that lizard flu is not 
fatal and, if you have it, you’re in for a full 
recovery after a few weeks of treatment. The 
bad news is that lizard flu is a big pain in 
the butt. You’ll have to miss work, and you 
will have to stay away from your loved ones 
for a few weeks. 



LIZARl) FLU TEST RESULTS 


Date: 

Name: 


Today 


Head First Data 

Diagnosis: Positive 

Here's some informatioijgr^^^flu ： 

Lizard flu is a tropicajMi^bwfirst 
observed among l^wUesewchers in 
South America.^ 

The disease^nML^ontagious, and 
affected to be quarantined 

in fewer than six 


The disease 
affected p 
in their 
weel 

with lizard flu have 
known to “taste the air” and 
in^o^me cases have developed 
tem^rary chromatophores and 
zygodactylous feet. 


Your doctor is convinced that you have it, 
but because you’ve become so handy with 
data, you might want to take a look at the 

test and see just how accurate it is. 


170 


Chapter 6 





bayesian statistics 


Sharpen your pencil 


A quick web search on the lizard flu diagnostic test has yielded 
this result: an analysis of the test’s accuracy. 









Med-O-Pedia 
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Lizard ilu ctiagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability 
that the test returns positive for it is 90 
percent. 

If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 


This is Sy\ m-tcv-cstmg s-tatistid. 


” 0 %… looks 
pv-c*t*ty solid. 


IM *V>£AnDJ 电疆 


■ ■■• ■ |fc 

、 把 i a 9m 

•WfTmbWWm 


In light of this information, what do you think is the probability that 
you have lizard flu? How did you come to your decision? 


you are here ► 
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be careful with probabilities 

r «^rpen your pencil 



You just looked at some data on the efficacy of the lizard flu 
diagnostic test. What did you decide were the chances that you 
have the disease? 


Lizarct ilu diagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability 
that the test returns positive for it is 90 
percent. 

If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 


In light of this information, what do you think is the probability that 
you have lizard flu? How did you come to your decision? 

I 七 looks like *thc c\\By\Ccs would be °(0% i-f I had -the disease. Bu*t ⑽七 everyone has -the disease as 


second s*ta*tis*tid shows. So | should revise my estimate dovm a little bi 七 . Bu*t i 七 doesn't seeing like 
answer is go'mj *bo be e%ad*tly % } because 七 ha 七 would be -boo easy, so, | durmo, maybe 

75%? 

Tke answer is way 
lower tkan ?5%! 



Watch it! 


75% is the answer that most 
people give to this sort of 
question. And they’re way off. 


Not only is 75% the wrong answer, 
but it’s not anywhere near the right 
answer. And if you started making decisions with 
the idea that there’s a 75% chance you have lizard 
flu, you’d be making an even bigger mistake! 


There is so much at stake 
in getting the answer to 
this question correct. 

We are totally going to get to the 
bottom of this... 
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bayesian statistics 


Let's take the accuracy 
analysis one claim at a time 

There are two different and obviously important 
claims being made about the test: the rate at which 
the test returns “positive” varies depending on 
whether the person has lizard flu or not. 

So let’s imagine two different worlds, one 

where a lot of people have lizard flu and one where 
few people have it, and then look at the claim 
about “positive” scores for people who don s t have 
lizard flu. 


r ^harpfln your pencil 


S*ta\rt hcv-c. 


Lizarct liu ctiagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability 
that the test returns positive for it is 90 
percent. 

If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 



Lets v-cally yb 

*tWis s* *ta-tcw>c\r\*t... 


Take a closer look at the second statement 
and answer the questions below. 


Lizard flu ctiagnostic test 

Accuracy analysis 


TWk really hard 
dbou 七七 his. 



If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 



Scenario 1 

If 90 out of 100 people have it, how many 
people who don’t have it test positive? 


Scenario 2 

If 10 out of 100 people have it, how many 
people who don’t have it test positive? 
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prevalence counts 


^Sharpen your pencil 

Solution 


Does the number of people who have the disease affect how 
many people are wrongly told that they test positive? 


Lizarct flu diagnostic test 

Accuracy analysis 

If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 



Scenario 1 


Scenario 2 

If 90 out of 100 poeple have it, how many 


If 10 out of 100 people have it, how many 

people who don’t have it test positive? 


people who don’t have it test positive? 

This 10 people dor /七 have rt, so 


This medhs °[0 people dor / 七 have i*t) 

we *take of \0 people, whidh is about 1 


so we *takc °{0 people, whidh is 10 

person who -tests positive bu 七 doesn't have it 


people who *tcs*t positive bu*t dor / 七 have i*t- 


How common is lizard flu really? 

At least when it comes to situations where people 
who don’t have lizard flu test positively, it seems 
that the prevalence of lizard flu in the general 
population makes a big difference. 

In fact, unless we know how many people 
already have lizard flu, in addition to the 
accuracy analysis of the test, we simply cannot 
figure out how likely it is that you have lizard flu. 
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bayesian statistics 


You've been counting false positives 


In the previous exercise, you counted the 
number of people who falsely got a positive 
result. These cases are called false positives. 

ttcirc s I, lots 

people have lizard *flu 



/W hcv-c s Sdcr>av*io 
Z, y/Kcvc *fc>M fcoflc 
have *tV>c disease. 


4 


doesn’t 
have it 


The opposite of a false 
positive is a true negative 


°l% o( people 
dov^i have i*b is jus 七 
ov\t -false fos'itivc. 



false 
positive 


In addition to keeping tabs on false positives, you’ve 
also been thinking about true negatives. True 
negatives are situations where people who don’t have 
the disease get a negative test result. 


T^ 0 o( ihc people who 
do 的七 have i-t is a 
lot o( -false positives/ 


1^ you l.^d ^ test result 

*,s -false positive or *brue ir^ative. 


If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 



False positive v-atc 


T\ruc negative v-a*tc 


If someone doesn't have lizard flu, the probability 
that the test returns negative for it is 91%. 



I^rpen your petKd 



If someone has lizard flu, the probability 
that the test returns positive for it is 90 
percent. 



What term do you think describes this statement, and 
what do you think is the opposite of this statement? 
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meet conditional probability 

^Sharpen your pencil 
^ Solution 


What term would you use to describe the other 
part of the lizard flu diagnostic test? 


Lizard ilu diagnostic test 

Accuracy analysis 


If someone has lizard flu, the probability 
that the test returns positive for it is 90 
percent. 



This is -false 
negative \ra*tc- 


This is *tv"uc 
positive v-a*tc- 


l-f someone hds lizard -flu, *thc pvobabili*ty 
he *tcs*b hc^a*tivc -for i*t is 10 %. 




All these terms describe 
conditional probabilities 

A conditional probability in the probability 
of some event, given that some other event has 
happened. Assuming that someone tests positive, 
what are the chances that he has lizard flu? 

Here’s how the statements you’ve been using look 
in conditional probability notation: 


This is the pv-obabili-ty a 
positive test wuli givch that 
pmoh docshjt have liz^v-d -flu. 


This V-CfVCSCir>*b 
*tV>c *tvuc positive. A 


P(+|L) = 1 -P(-|L) 


This v-cp\rcsc^ts 

七 he -false K>c<\aiivc. 



P(+ 卜 L) = 1 - P ( 七 L) 


This vcfvcscrrb 
七 lie *tvuc 


This \rcpvcscr>*b 

七 he 


v-cf\rcscr>*cs 

-false positive. 


This is 七 he fv-obab*il*i*ty 
o( a positive 七 《 七 

VCSul*t, lizav-d -flu. 


The -tilde symbol 

(L) is 的 0 七 *brue. 



Can^xfiana] pp^al?l]ity lip C]^S6 

Let’s take a look at what each symbol means in this statement: 
The probability of lizard flu given a positive test result. 


pvobabirrty 


1'iz^ird -flu 





fosi-tivc icsi \rcsuli 
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You weed to count 



false positives, 
true positives, 
false negatives, and 
true negatives 


bayesian statistics 


Figuring out your probability of having lizard flu is 
all about knowing how many actual people are 
represented by these figures. 



tto>w a 乙 U yeoyle 

m*to cadii o-f 
fv-obab'ili-ty youfi 呼？ 


P(+|~L), the probability at someone tests positive, given that they don’t have lizard flu 
P(+|L), , the probability at someone tests positive, given that they do have lizard flu 
P(-|L), the probability at someone tests negative, given that they do have lizard flu 
P(-|~L), the probability at someone tests negative given that they don’t have lizard flu. 


But first you need to know how many people have 
lizard flu. Then you can use these percentages to 
calculate how many people actually fall into these 
categories. 


O 


0 


Yeah, I get it. So 
how many people 
have lizard flu? 





This is -f i^uv*c you vjdv\i! 


P(L|+) 


is the probability 

\\z3\rd *Plu, —ha 

positive -test v-csult? 
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base rate follies 


1 percent of people have lizard flu 

Here’s the number you need in order to interpret 
your test. Turns out that 1 percent of the 
population has lizard flu. In human terms, that’s 
quite a lot of people. But as a percentage of the 
overall population, it’s a pretty small number. 

One percent is the base rate. Prior to learning 
anything new about individuals because of 
the test, you know that only 1 percent of the 
population has lizard flu. That’s why base rates 
are also called prior probabilities. 


Center for disease tracking 
is ow top of lizard flu 

Study finds that 1 percent of 
national population has lizard flu 

The most recent data, which is current as 
of last week, indicates that 1 percent of 
the national population is infected with 
lizard flu. Although lizard flu is rarely fatal, 
these individuals need to be quarantined 
to prevent others from becoming infected. 



Watch out for the base rate fallacy 



I just thought that the 
90% true positive rate 
meant ifs really likely that 
you have it! 


That’s a fallacy! 

Always be on the lookout for base rates. You 
might not have base rate data in every case, but 
if you do have a base rate and don’t use it, you’ll 
fall victim to the base rate fallacy, where 
you ignore your prior data and make the wrong 
decisions because of it. 

In this case, your judgment about the probability 
that you have lizard flu depends entirely on the 
base rate, and because the base rate turns out to 
be 1 percent of people having lizard flu, that 
90 percent true positive rate on the test 
doesn’t seem nearly so insightful. 
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bayesian statistics 


Sharpen your pencil 


Calculate the probability that you have lizard flu. Assuming you 
start with 1,000 people, fill in the blanks, dividing them into 
groups according to your base rates and the specs of the test. 


Lizarct flu diagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability 
that the test returns positive for it is 90 
percent. 

If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 


Rcmcrwbcv, 1% Jc fCOflc 
have lizav-d -flu. 


1,000 people 






The number of 
people who have it 




The number of people 
who don’t have it 




The number who 
test positive 


The number who 
test negative 


The number who 
test positive 


The number who 
test negative 


The probability that 
you have it, given that 
you tested negative 


# of people who have it and test negative 

(# of people who have it and test negative) + 

(# of people who don’t have it and test negative) 
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things don’t look so bad 


^Sharpen your pencil 

Solution 


What did you calculate your new probability 
of having lizard flu to be? 


Lizarct flu diagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability 
that the test returns positive for it is 90 
percent. 

If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 


1,000 people 


o( people *tcs*tcd 

positively Kavc liz^vdl *flu. 

~ 10 




^)1 % o-P people who’ve "tes-ted 
positively AorU have liz^vd -flu. 



The number of 
people who have it 




The number of people 
who don’t have it 




. 1 . 

The number who 
test positive 


The number who 
test negative 


. 01 . 

The number who 
test positive 


°10\ 

The number who 
test negative 


The probability that 
you have it, given that 
you tested negative 


# of people who have it and test negative 

(# of people who have it and test negative) + 

(# of people who don’t have it and test negative) 


w 


二 aol 


丁 hevVs a 9% c\\^y\Cc I have liz^v-d -flu| 
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bayesian statistics 


Hcv-c avc 
1,000 people- 

V, 



These 3re 
dll 七 he tv-uc 
negatives. 



tteve av-c *tV>c 
*tvuc positives. 




1 赛碑吻 • y ^ I'fl 呢呢吻 • • 

n n (\ (i fi i\ fi n (\ i'\ i\ n 


鳃 ^ 

i\ (\ i\ 


® ® © 
i ii l\ 

ll 


© © cp cp © © cp 
il il ll il ii Li ll 


LA 


L4 4 4 悉 、 t 志.秦 ^ i 4 ii -i 1 l 4 k 4 k % 遍 5 售 . 志逢 _ i 4 •應 'Ll 羞 , i i; i i _•!» $ 4_ 4 4 t i sit 感 ▲. ,4 

. 


n n n 


Wc\rc a\rc the 
■false positives. 


YouVe ciihcv- a tv-uc positive or a -false positive, and 
its a lot mov-c likely *tha*t youVe a -false positive. 


Your chances of having lizard flu are still pretty low 
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keep it simple 


Vo complex probabilistic thinking 
with simple whole numbers 

When you imagined that you were looking at 1,000 
people, you switched from decimal probabilities to whole 
numbers. Because our brains aren’t innately well-equipped 
to process numerical probabilities, converting probabilities 
to whole numbers and then thinking through them is a very 
effective way to avoid making mistakes. 




Paycs' rule manages your base 
rates when you get new data 


Believe it or not, you just did a commonsense 
implementation of Bayes’ rule, an incredibly powerful 
statistical formula that enables you to use base rates 
along with your conditional probabilities to estimate new 
conditional probabilities. 

If you wanted to make the same calculation algebraically, 
you could use this monster of a formula: 



This -foVmula will yve 
you same v-csul*t 
Vou \us*t d3ldula*tcd- 


T\\t base v-atc (fcoflc 
The pv-obab'irrty <^f lizav-d -flu y,v^o l^avc disease) 
y\itY\ a positive *tcs*t vesul 七 


The "tvuc positive vatc 


P(L I +) 


P(L)P(+ IL) 


P(L)P (+ IL) + P ㈠ P(+ hL) 


The base va*tc (people >who 
do〆 七 Kave the disease) 


The -false positive va*tc 
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bayesian statistics 


You can use Payes' rule 
over awd over 


Bayes 5 rule is an important tool in data analysis, 
because it provides a precise way of incorporating new 
information into your analyses. 





aTa 

w |w 


Bayes’ v-ulc lets you 
ddd move 




l^p 


So the test isn’t that 
accurate. Youre still nine times 
more likely to have lizard flu 
than other people. Shouldn’t you 
get another test? 

The base rate: 


Ycf) youVc ^ w\o\rc likely 

•to have liz^v-d -flu *tV^ar\ 
i\\t v-c^ulav- fofulaticw. 


Your doctor took tke suggestion 
anct orcterect anotker test. Let’s 
see wkat it saict … 
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a new test result 


Your second test result is negative 


The doctor didn’t order you the more 
powerful, advanced lizard flu test the first 
time because it’s kind of expensive, but 
now that you tested positively on the 
first (cheaper, less accurate) test, it’s time 
to bring out the big guns … 


The dod*fcov- ov*dcv*cd a sli^ivtlY 
diAwt w advar>dcd w 

li^avd *flu dia^ostid test. 


ADVANCED LIMlll) FLU TEST 
RESULTS 


Date: Today 

Name: Head First Data 

Diagnosis: Negative 
Here's some inform 


Analyst 


Lizard flu is a 
observed among 
South America. 


，G 

ation nn rHI / flu: 

: al Cd^ea^/first 
;fr%"sparchers in 


tropical 
mg liz' rci 


The disease 
affected patii 
in 


in their honies h 

4i agnosed 
been known to “tas 
in extreme cases h 


ca. 

] ： igh!/ conta 
nts L ,ed to b( 
euo fewer 


gious, and 
be quarantined 
fewer than six 


with lizard flu have 
‘taste the air” and 
have developed 
temporary chromatophores and 
zygodactylous feet. 

That^ a relief! 



o 





You got these probabilities 
wrong before. 

Waf Better run the numbers again. By 

W41vn U* now ^ y 0U know that responding 

to the test result (or even the test 
accuracy statistics) without looking at base 
rates is a recipe for confusion. 
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bayesian statistics 


The new test has different 
accuracy statistics 


Using your base rate, you can use the new 
test’s statistics to calculate the new probability 
that you have lizard flu. 


This is *tcs*t you *took. 


Lizarct ilu diagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability 
that the test returns positive for it is 90 
percent. 

If someone doesn’t have lizard flu, the 
probability that the test returns positive 
for it is 9 percent. 



Should we use the same 
base rate as before? You tested 
positive. It seems like that should 
count for something. 



了 Vis hew test is move 
expensive but move powcv--ful. 


Actvancect 

Lizard ilu ctiagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability that 
the test returns positive for it is 99 perteaot 

If someone doesn’t have lizard flu, the 
probability that the test returns positive for it is 
1 percent. 



TKcsc atduv-ady 
av-e a lo*t s*tvoir>^cv-. 


Sharpen your pencil 


What do you think the base rate should be? 
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base rate revision 


^harpen your pencil 

Solution 


What do you think the base rate should be? 

I % *t be -the base ra*tc. The base ra*tc is -the wc jus 七 ^aldula-bed ； 

because *tha*t -figure is my ovm j>robabili*ty o-f hav*m^ *the disease. 


New iwformatiow caw 
change your base rate 

When you got your first test results 
back, you used as your base rate 
the incidence in the population of 
everybody for lizard flu. 


1 % of everybody 
has lizard flu 


Old base v-atc 



>u used *to be pav-*t o( *tWs youf 



But you learned from the test that your 
probability of having lizard flu is higher 
than the base rate. That probability is 
your new base rate, because now you’re 
part of the group of people who’ve 
tested positively. 


•how youVc pa\rt of this youp 




9% of people who tested 
positively have lizard flu 




uv / 


base v-a*tc 




Jus*t a vc^ulav- fCV-soir>. 


Let’s kurry up 
and run Bayes’ 
rule again... 
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bayesian statistics 


Sharpen your pencil 


Using the new test and your revised base rate, let’s calculate the 
probability that you have lizard flu given your results. 


Actvancect 


Lizard llu ctiagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability that 
the test returns positive for it is 99 percent. 

If someone doesn’t have lizard flu, the 
probability that the test returns positive for it 
is 1 percent. 


Rcmcmbcv, o( fCOflc 
like you v/'ill have liz^v-d *flu. 


The number of 
people who have it 


1,000 people 



The number of people 
who don’t have it 








The number who 
test positive 


The number who 
test negative 


The number who 
test positive 


The number who 
test negative 


The probability that 
you have it, given that 
you tested negative 


# of people who have it and test negative 

(# of people who have it and test negative) + 

(# of people who don’t have it and test negative) 
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bayes saves the day 


^Sharpen your pencil 

Solution 


What do you calculate your new probability 
of having lizard flu to be? 


Actvancect 

Lizard llu diagnostic test 

Accuracy analysis 

If someone has lizard flu, the probability that 
the test returns positive for it is 99 percent. 

If someone doesn’t have lizard flu, the 
probability that the test returns positive for it 
is 1 percent. 


1,000 people 


o( people *tcs*tcd 

positively Kavc li^av-d *flu. 




10 




^/% o-f people who’ve "tes-ted 
positively dov^i have liz^d <flu. 

°[\o 


The number of 
people who have it 


The number of people 
who don’t have it 


01 


The number who 
test positive 


The number who 
test negative 


1 


The number who 
test positive 


°[ 0 \ 


The number who 
test negative 


The probability that 
you have it, given that 
you tested negative 


# of people who have it and test negative 

(# of people who have it and test negative) + 

(# of people who don’t have it and test negative) 


I +101 


二 0.001 


There’s a 0 . 1 % 七丨 have lizard flu! 
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bayesian statistics 


What a relief! 


You took control of the probabilities 

using Bayes’ rule and now know how to 
manage base rates. 


The only way to avoid the base rate fallacy 
is always to be on the lookout for base rates 
and to be sure to incorporate them into your 
analyses. 



Youv- fvobab*il*i*ty o( 
liz^v-d -flu is so lo>w you 
da 内 mudh v-ulc *i*t ou*t. 




Ko lizav-d -flu (or yoJ 



Now you’ve just got to 
skake tkat eolct... 
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7 subjective probabilities 



Sometimes, it’s a good idea to make up numbers. 

Seriously. But only if those numbers describe your own mental states, expressing your 
beliefs. Subjective probability is a straightforward way of injecting some real rigor into 
your hunches, and you’re about to see how. Along the way, you are going to learn how 
to evaluate the spread of data using standard deviation and enjoy a special guest 
appearance from one of the more powerful analytic tools you’ve learned. 


this is a new chapter 



opportunity in obscurity 


Packwatcr IwvestmGwts needs your help 

Backwater Investments is a business that tries to 
make money by seeking out obscure investments 
in developing markets. They pick investments that 
other people have a hard time understanding or 
even finding. 


Badky/3*tcv* ovms tompavucs … 


i hc\rc... 



> 






cvch 


Their strategy means that they rely heavily on the 

expertise of their analysts, who need to have 
impeccable judgment and good connections to be 
able to get BI the information they need for good 
investment decisions. 


It’s a cool business, except it’s about to be torn 
apart by arguments among the analysts. The 
disagreements are so acrimonious that everyone’s 
about to quit, which would be a disaster for the 
fund. 


The internal crisis at Backwater 
Investments might force the company 
to shut down. 
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subjective probabilities 


Their analysts are at 
each other's throats 


The analysts at BI are having big disagreements 
over a number of geopolitical trends. And this is a 
big problem for the people trying to set investment 
strategy based on their analyses. There are a bunch 
of different issues that are causing splits. 


The analysts are in full revolt! 
If I don’t get them agreeing 
on something, they'll all quit. 



Where precisely are the disagreements? It would 
be really great if you could help figure out 
the scope of the dispute and help achieve 
a consensus among the analysts. Or, at the 
very least, it’d be nice if you could specify the 
disagreements in a way that will let the BI bosses 
figure out where they stand. 



Let’s take a look at 
the disputes... 
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whafs the fuss? 

^Jharpen your pencil_ 

< Take a look at these emails, which the analysts have sent you. Do 

they help you understand their arguments? 


From: Senior Research Analyst, Backwater Investments 

To: Head First 
Subject: Rant on Vietnam 

For the past six months, I’ve consistently argued to 
the staff that the Vietnamese government is probably 
going to reduce its taxes this year. And everything that 
we’ve seen from our people on the ground and in news 

reports confirms this. 

Yet others in the “analytical” community at Bl seem 
to think this is crazy. Tm a considered dreamer by the 
higher-ups and told that such a gesture or the part of 
the government is “highly unlikely.” Well, what do they 
base this assessment on? Clearly the government is 
encouraging foreign investment. YW tell you this: if taxes 
go down, there will be a flood of private investment, and 
we need to increase our presence in Vietnam before the 


丁 analysts a\rc kihd 
bcht out shape. 



•tlicsc *thvcc douy>*tv*'iCs? 


T^Hea^First 1 八门卿乳 Backwater Investments 

Subject: Investing in obscure places: A Manifesto 

Russia ， Indonesia, Vietnam. The community at Bl has 
be COme obsessed with these three places. Yet aren J t 
p e ans wers to all our questions abundantly clear? 
!J U f s,a w,11 continue to subsidize oil next quarter like 

P a W f. ys haSj and 出 ey，re more likely than not to buy 
EijroA 丨 r next quarter. Vietnam might decrease taxes 
s year, and they probably aren，t going to encourage 
investment. Indonesia will more likely than not 
invest in ecotourism this year, but it won，t be of much 
help. Tourism will definitely fall apart completely. 

fl [° the dissenters and troublemakers who 
spute these truths, the firm might as well close". 
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subjective probabilities 



From: VP, Economic Research, Backwater Investments 
To: Head First 

Subject: Have these people ever even been to Russia? 

While the analytic stuff in the Economic division has 
continued to flourish and produce quality work on 
Russian business and government, the rest of Bl has 
shown a shocking ignorance of the internal dynamics 
of Russia. It’s quite unlikely that Russia will puchase 
EuroAir, and their support of the oil industry next 

mst tern 


quarter may be the 糊 


mtative 


Evth a boy is 

s-tav-tmg -fco lose W\s doo|/ 


This ftuV^s -fv-om 

七 he -field, 

do'm^ *f ivs*tV>a^d vcscav-tV>. 


From: Junior Researcher, Backwater Investments 

To: Head First 
Subject: Indonesia 

You need to stop listening to the eggheads back at 
corporate headquarters. 

The perspective from the ground is that tourism 
definitely has a good chance of increasing this year 

and Indonesia is all about ecotounsm. The egghead 
don 5 t know anything, and Tm starting to think that my 
intel would be better used by a competing firm … 


What are the key issues causing the disagreement? 


The authors each use a bunch of words to describe what they think the 
likelihoods of various events are. List all the "probability words” they use. 
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areas of disagreement 


Sharpen your pencil - 
• 、 Solution 

Thcv-c dirt d \)\ay\CM Jc f\robab*iri*ty 
y/ov~ds used m 七 hesc emails... 


What are your impressions of the arguments, now that you’ve 
read the analysts'emails? 




From: Senior Research Analyst, Backwater 
Investments 
To: Head First 
Subject: Rant on Vietnam 

For the past six months, I’ve consistently argued to ^ 
the staff that the Vietnamese government is probably 
going to reduce its taxes this year. And everything that 
we’ve seen from our people on the ground and in news 
reports confirms this. 

Yet others in the “analytical” community at Bl seem 
to think this is crazy. I'm a dreamer by the higher- 
ups and told that such a gesture or the part of the 
government is “highly unlikely.” Well, what do they 
base th^assessment on? Clearly the government 
is encouraging foreign investment. Nl tell you this: 
if ta^es ao down, there will be a flood of private _ 

From: VP, Economic Research, Backwater Inve 
To: Head First 

Subject: Have these people ever even been to ( 

While the analytic stuff in the Economic divisio 
continued to flourish and produce quality work 
Russian business and government, the rest of I 
shown a shocking ignorance of the internal dyr 
of Russia. It’s quite unlikely that Russia will puc 
EuroAir, and their support of the oil industry ne 
quarter may be the most tentative ifs ever bee 


T r °u ： P°[J* ical Analyst, Backwater Investments 
To: Head First 

Subject: Investing in obscure places: A Manifesto 

Russia, Indonesia, Vietnam. The community at Bl has 
become obsessed with these three places. Yet aren't 
e answers to all our questions abundantly clear? 
,^? S，a W, '' 咖如此 to subsidize oil next quarter like 
p a,w ^ys has, and theyYe more likely than not to buy 
EuroAir next quarter. Vietnam mi^ht decrease taxes 

tr n ar ， and t they Pliably aren J t going to encourage 
pr eign investment. Indonesia will more likely than not 
investm ecotounsm this year, but it won'tbe of much 
help. Tourism will definitely fall apart completely. 

riifn, d t° e t!. n * f '* e thG dissenters and troublemakers who 
dispute these truths, the firm might as well close... 



From: Junior Researcher, Backwater Investments 

To: Head First 
Subject: Indonesia 

You need to stop listening to the eggheads back at 
corporate headquarters. 

The perspective from the ground is that tourism 
definitely has a good chance of increasing this year,^ 
andlndonesia is all about ecotourism. The eggheads 
don ! t know anything, and l ! m starting to thmk that my 
Intel would be better used by a competing firm … 


What are the key issues causing the disagreement? 


There seem *to be six. areas o-f disagveci^Ch*t ： I) l/Vill Russia subsidize oil business 气 uartev? 

Z) l/Vill Russia purchase EuroAiv-? l/Vill dedvease -this year? l/Vill 

g,ovc\rhmCh*t Chdouvagc this year? 5) W\W Ihdohesidh *tou\rism ihdrease *this ycav-? 

iVill *the Ihdohesidh *mvcs*t cdo-touv-ism? 


The authors use a bunch of words to describe what they think the likelihoods 
of various events are. List all the "probability words” they use. 

The y/ovds -they use arc ： probably, highly unlikely, more likely, unlikely, 你叫 , 


dc-f mi*tdy, ^ood 
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subjective probabilities 


Jim: So we’re supposed to come in and tell everyone 
who’s right and who’s wrong? That shouldn’t be a 
problem. All we need is to see the data. 


Frank: Not so fast. These analysts aren’t just regular 
folks. They’re highly trained, highly experienced, 
serious domain experts when it comes to these countries. 

Joe: Yeah. The GEO says they have all the data 
they could ever hope for. They have access to the best 
information in the world. They pay for proprietary data, 
they have people digging through government sources, and 
they have people on the ground doing firsthand reporting. 

Frank: And geopolitics is highly uncertain stuff. 

They’re predicting single events that don’t have a big trail 
of numerical frequency data that you can just look at 
and use to make more predictions. They’re aggregating 
data from a bunch of sources and making very highly 
educated guesses. 

Jim: Then what you’re saying is that these guys are 
smarter than we are, and that there is really nothing we 
can do to fix these arguments. 


Joe: Providing our own data analysis would be just 
adding more screaming to the argument. 


Frank: Actually, all the arguments involve hypotheses 
about what’s going to happen in the various countries, 
and the analysts really get upset when it comes to those 
probability words. “Probably?” “Good chance?” What 
do those expressions even mean? 



Jim: So you want to help them find better words to 
describe their feelings? Gosh, that sounds like a waste 
of time. 


Frank: Maybe not words. We need to find something 
that will give these judgments more precision, even 
though they’re someone’s subjective beliefs … 


How would you 
make tke protatility 
words more precise? 
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meet subjective probabilities 


Subjective probabilities 
describe expert beliefs 

When you assign a numerical probability to 
your degree of belief in something, you’re 
specifying a subjective probability. 

Subjective probabilities are a great way to 
apply discipline to an analysis, especially when 
you are predicting single events that lack hard 
data to describe what happened previously 
under identical conditions. 


Everyone talks like this... 



...but what do they really mean? 



TKcsc av-c subjedtWe ^roloabili 七 iet 


丁 kse -Piguvcs a\rc _ 乩 mov-e 


pmcdisc the wo\rds "the 3halys-ts 
-to dcs^ibc theiv- beliefs. 
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subjective probabilities 


Subjective probabilities might show 
no real disagreement after all 




0 


■ 

會 


， ^iarf«n your pencil 


Sketch an outline of a spreadsheet that would contain all the 
subjective probabilities you need from your analysts. How would you 
structure it? 



a pidtuvc o( 七 he 
spv-cadshcc*t you hcv-c. 


lA/V^a-t you y/a^-b >s a subj 杜 *twe 〆 
pvobability ca^ analyst ^ 
tacM o-f key av-cas of dispute. 
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visualize them in a grid 

r 〜 Sharpen your pencil 

Solution 


What does the spreadsheet you want from the analysts to 
describe their subjective probabilities look like? 


The -table Will 
of 七 he six. 

dr>d I'isi 七 hem a 七 *thc *tof • 


r Russia y/ill subsidize oil business hC>c*t 
Russia will purchase a European airl'me hC%*t <\uar*ter. 

^ - \/ic*thaim will decrease -ta^cs -this year. 

^ovc\rhmCh*t will *mvcs*ti^Ch*t *this year. 

Ihdohesidh -bourism will mdrease *this year. 

Ihdohesidh govc\rhimCh*t will *mvcs*t m edo*tou\rism. 

1/Vc tay\ -f ill blanks o( y/ha 七 

analyst 七 Wmks abou 七 cadh s*t3*tcmcir>*t hcv*c- 


Analyst 

Stol'tcmcia-t 1 


S"tol"tcr»»Cirrt 1> 

S"ta*tcr»»cirrt \ 

Sta 七 emeirrt 5 

£七3七 emeirrt 厶 
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subjective probabilities 


G 

Statement& 

77% 

81% 

85% 

78 % 

72 % 

84% 

aa% 

69 % 
74% 
§ 1 % 
92% 
70% 
3% 
81% 
97% 
70% 
辦 a 
的％ 
91% 
£ 34 % 


c n e 

Statement2 Statements Statement 


68% 

37% 

39% 

40% 

11% 

56% 

47% 

67% 

3S% 

88% 

7% 

38% 

37% 

8% 

19% 

ao% 

30% 

19% 

47% 

66% 

27% 

46% 

41% 

33% 

亏旮％ 

63% 

14% 

23% 

9% 

30% 

34% 

0% 

SS% 

70% 

蹲 6% 

28^? 

70% 

45% 

33% 

80% 

35% 

35% 

54% 

15% 

16% 

67% 

63% 

19% 

74% 

14% 

33% 

21% 

22% 

40% 

21% 

42% 


36% 

E3?% 

2/^h 


B 

Statementl 

67% 

68% 

89% 

91% 

91% 

9 . 2 % 

87 % 





Now we’re getting somewhere. 

While you haven’t yet figured out how to 
resolve all their differences, you have definitely 
succeeded at showing where exactly the 
disagreements lie. 

And from the looks of some of the data, it 
might not be that there is all that much 
disagreement after all, at least not on some 
issues. 


"to look d little rwo\rc 


Let’s see wkat tke CEO kas 

to say about tkis ctata … 


The analysts responded with 
their subjective probabilities 


That is really intriguing 
work. The arguments don’t 
seem so nasty from this 
angle... 



6 2 2 t I 1 
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te 
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%%%%%%%%%%%%% 
2 B z s 9 z s 9 o 2 1 o 1 
9s9389389Qr-98y 
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0 9 01Z34-567 89 0 

1i 1i~ 1i 1A ^1* .T* OT* 



you are here ► 


201 













ceo puzzled by your work 


The CEO doesn't m 
what youVc up to 

It appears that he doesn’t think these 
results provide anything that can be 
used to resolve the disagreements 
among the analysts. 


Wt doesn't *tWk 
-P^uv-cs av-c o-f ar\y help. 


From: CEO, Backwater Investments 
To: Head First 

Subject: Your “subjective probabilities” 

I’m kind of puzzled by this analysis. 
What we，ve asked you to do is resolve 
the disagreements among our analysts, 
and this just seems like a fancy way of 
listing the disagreements. ^ 

We know what they are. That’s not 
why we brought you on board. What 
we need you to do is resolve them or 
at least deal with them in a way that 
will let us get a better idea of how to 
structure our investment portfolio in 
spite of them. 

You should defend your choice of 
subjective probabilities as a tool for 
analysis here. What does it get us? 


|s he 


The pv-cssuv-c^s oJ 


You should probably explain 
and defend your reason for 
collecting this data to the 
CEO... 
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subjective probabilities 


tfftempntl 

0 糾 

si% 

91 % 

32 % 

a?^i 

93 % 

9 &% 

92 %] 

e&^ 

S 2 % 

sum 

am 

92 % 
91 % 
ap% 
9 m 




Sharpen your pencil 


Is your grid of subjective probabilities... 


..any more u 


From: Political Analyst, Backwater Inve 
To: Head First 

Subject: Investing in obscure places: A 

Russia, Indonesia, Vietnam. The commi 
become obsessed with these three plac 
answers to all our questions abundc 
Russia will continue to subsidize oil ne> 
it always has, and they’re more likely thi 
^roAir next quarter. Vietnam might dec 
this year, and they probably aren’t going 
foreign investment. Indonesia will more 
j^vest in ecotourism this year, but it won 
help. Tourism will definitely fall apart cor 

If Bl doesn’t fire the dissenters and troub 
dispute these truths, the firm might as we 


From: Senior Research Analyst, Backwater 
Investments 
To: Head First 
Subject: Rant on Vietnam 

For the past six months I’ve consistently 
the staff that the Vietnamese governmen 
going to reduce its taxes this year. And e 
we’ve seen from our people on the groui 
reports confirms this. 

Yet others in the “analytical” community 
think this is crazy. Tm a dreamer by the 
told that such a gesture or the part of the 
is “highly unlikely.” Well what do they ba 
assessment on? Clearly the government 
foreign investment. Y\\ tell you this: if 
there will be a flood of private investmenl 


From: VP, Economic Research, Backwat 
To: Head First 

Subject: Have these people ever even be 

While the analytic stuff in the Economic 
continued to flourish and produce qualit 
Russian business and government, the r 
shown a shocking ignorance of the interr 
of Russia. It’s quite unlikely that Russia v 
EuroAir, and their support of the oil indui 
quarter may be the most tentative ifs ev 


From: Junior Researcher, Backwater Investments 
To: Head First 
Subject: Indonesia 

You need to stop listening to the eggheads back at 
corporate headquarters. 

The perspective from the ground is that tourism 
definitely has a good chance of increasing this year, 
and Indonesia is all about ecotourism. The eggheads 
don't know anything, and I’m starting to think that my 
Intel would be better used by a competing firm... 


Why or why not7 
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client persuaded 


^Sharpen your pencil 

Solution 


Is your grid of subjective probabilities... 


Any more useful analytically than these angry emails? 


! ro ^ : Political Analyst, Backwater Investments 

io: Head First 

Subject: Investing in obscure places: A Manifesto 
Russia, Iru 


become ot 
the answer 
Russia will 
it always h« 
EuroAir ne 】 
this year, a» 
foreign inve 
invest in ec 
help. Touris 

If Bl doesn’t 
dispute thes 


From: Senior Research Analyst, Backwater 
Investments 
To: Head First 
Subject: R" 


For the pa 
the staff tl 
going to r 
we’ve see 
reports cc 

Yet others 
think this 
told that 5 
is “highly 
assessm( 


From: VP, Economic Research, Backwater Investments 
To: Head First 

Subject: Have these people ever even been to Russia? 

While the at 
continued 
Russian bu 
shown a sh< 
of Russia. It 
EuroAir, anc 
quarter ma^ 
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From: Junior Researcher, Backwater Investments 
To: Head First 
Subject: Indonesia 

You need to stop listening to the eggheads back at 
corporate headquarters. 

The perspective from the ground is that tourism 
definitely has a good chance of increasing this year, 
and Indonesia is all about ecotourism. The eggheads 
don't know anything, and I'm starting to think that my 
intel would be better used by a competing firm... 






B 九 


l¥h 

UJk 




The subjective pvobabili*tics shoy/ some areas arc ho*t as dor\*tcr\*tious as y/c pvcviously *thou^*t- 
The subjective pvobabilrties dre d precise s^cdi-fida*tioh o-f where is dhd hoy/ mudK 

i*t is. The analysts ddh use -them -to help -them -figure ou*t *thcY should -fodus oy\ to 

solve *thci\r problems. 


/ouW bought some time dhd 
匕 you\r y/o\rk. 


From: CEO, Backwater Investments 
To: Head First 

Subject: Visualization request 

■>OK, youVe persuaded me. But I don’t 
want to read a big grid of numbers. 
Send me a chart that displays this 
data in a way that is easier for me to 
understand. 


Le*t’s make *this visual | 
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subjective probabilities 


o t 


Staccnkentl 

lilt 



67% 

35% 

7 1 % 


㈣ 
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30^ 




41% 


电 ㈣ 

.14 〜 

CHfc 

SB% 


2ff% 
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19% 

1 相 
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40^. 
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SZ*fti 
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91 % 



For each value, plot a 
dot corresponding to the 
subjective probability. 


Ke vcv-*t"idal a%*is docs^-t v-cally 

aiitr, you dd^ jusi dots 

rou 灼 d so Vou ddir> sec all 



Statement 1 

Russia will subsidize oil business next quarter. 



Russia will purchase a European airline next quarter. 



Wcrcs av\ 
example. 


0.0 
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0.8 
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Statement 3 

Vietnam will decrease taxes this year. 




0.0 


0.2 


0.4 


0.6 


0.8 


1.0 


Statement 4 

Vietnam’s government will encourage foreign investment this year. 


j 



Statement 5 

Indonesian tourism will increase this year. 


Statement 6 

Indonesian government will invest in ecotourism. 
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— ^iharpen yuur pencil 
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visualize the spread 

r ^Jharpen your pencil 
k Solution 

How do the spreads of analyst 
subjective probabilities look on 
your dot plots? 
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li looks like 七 here is actually some 
dohSChSus oh this sta 七 emCh 七 . 

Statement 1 

Russia will subsidize oil business next quarter. 




Statement 2 

Russia will purchase a European airline next quarter. 
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Statement 3 

Vietnam will decrease taxes this year. 




Statement 4 

Vietnam’s government will encourage foreign investment this year. 
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Statement 5 | 

Indonesian tourism will increase this year, j 
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Statement 6 

Indonesian government will invest in ecotourism. 
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The CEO loves your work 


subjective probabilities 


From: CEO, Backwater Investments 
To: Head First 
Subject: Thank you! 

Now this is actually a big help. I can 
see that there are a few areas where 
we really should concentrate our 
resources to get better information. 

And the stuff that doesn，t appear to 
have real disagreement is just fantastic. 

From now on, I don’t want to hear 
anything from my analysts unless it’s in 
the form of a subjective probability (or 
objective probability, if they can come 
up with one of those). 

Can you rank these questions for me 
by their level of disagreement? I want 
to know which ones specifically are the 
most contentious.- CEO 


Subjective probabilities are something 
that everyone understands but that don’t get 
nearly enough use. 


Great data analysts are great communicators, 
and subjective probabilities are an illuminating 
way to convey to others exactly what you think 
and believe. 





What metric would measure disagreement 
and rank the questions so that the CEO 
can see the most problematic ones first? 
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meet standard deviation 


The standard deviation measures how 
far points are from the average 


You want to use the standard deviation. 

The standard deviation measures how far 
typical points are from the average (or mean) 
of the data set. 


Most of the points in a data set will be within 
one standard deviation of the mean. 


a sample daia set 


Mos*t obscv-va*tioy>s *m db'ta 
sc*t av-c *to be Wrthm oy\C 

s*ta 的 davd deviation o( -the mean. 


Average = 0.5 



0.0 


0.2 


0.4 


0.6 


0.8 


1.0 



One standard deviation = 0.1 


The unit of the standard deviation is 
whatever it is that you’re measuring. In the 
case above, one standard deviation from the 
mean is equal to 0.1 or 10 percent. Most 
points will be 10 percent above or below the 
mean, although a handful of points will be 
two or three standard deviations away. 

Standard deviation can be used here to 
measure disagreement. The larger the 
standard deviation of subjective probabilities 
from the mean, the more disagreement there 
will be among analysts as to the likelihood 
that each hypothesis is true. 


Use ST DEV -fovmula b> 

daldulatc 七 he sia^Aard deviation. 

=STDEV (data 



range) 


208 


Chapter 7 







subjective probabilities 


E%enclSe 


For each statement, calculate the standard deviation. Then, sort the list of questions 
to rank highest the question with the most disagreement. 


What formula would you use to calculate the standard deviation for the first statement? 



This diaia has *tuvir>cd or\ 
i*ts side so 七七 you so\rt 
-the or\Ct you 

•the siay>ddv-d deviation. 


this! 

w 

平 

www. headfirstlabs. com/books/hfda/ 
hfda— ch07_ data— transposed, xls 
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measuring contention 
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What standard deviations did you find? 


ExeRciSe 

SotuiloH 


What formula would you use to calculate the standard deviation for the first statement? 

STVBVCdZMV 


1 2 3 s & yr 
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subjective probabilities 


there 】 ore n9 o 

Dumb Questi9ns 


Q/ Aren’t subjective probabilities kind of deceptive? 

Deceptive? They’re a lot less deceptive than vague expressions 
like "really likely." With probability words, the person listening to 
you can pour all sorts of possible meanings into your words, so 
specifying the probabilities is actually a much less deceptive way to 
communicate your beliefs. 

I mean, isn't it possible or even likely (pardon the 
expression) that someone looking at these probabilities would 
get the impression that people are more certain about their 
beliefs than they actually are? 

You mean that, since the numbers are in black and white, they 
might look more certain than they actually are? 


That’s it. 

It’s a good concern. But the deal with subjective probabilities is 
the same as any other tool of data analysis: it’s easy to bamboozle 
people with them if what you’re trying to do is deceive. But as long 
as you make sure that your client knows that your probabilities are 
subjective, you’re actually doing him a big favor by specifying your 
beliefs so precisely. 

Hey, can Excel do those fancy graphs with the little dots? 

Yes, but it’s a lot of trouble. These graphs were made in a 
handy little free program called R using the dotchart function. 
You'll get a taste of the power of R in later chapters. 



Good work. I’m going to base my trading 
strategy on this sort of analysis from 
now on. If it pans out, you II definitely 
see a piece of the upside. 


、丁 he big boss 
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problems from russia 


Russia announces that it will 
sell all its oil fields, citing loss 
of confidence in business 

In a shocking move, Russian president 
poo-poohs national industry 

“Da, we are finished with oil,” said the Russian 
president to an astonished press corps earlier 
today in Moscow. “We have simply lost 
confidence in the industry and are no longer 
interested in pursuing the resource...” 
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subjective probabilities 


You were totally blindsided by this news 

The initial reaction of the analysts to this news is 
great concern. Backwater Investments is heavily 
invested in Russian oil, largely because of what you 
found to be a large consensus on oil’s prospects for 
continued support from the government. 

Statement 1 

Russia will subsidize oil business next quarter. 



0.0 0.2 0.4 0.6 0.8 1.0 

But this news could cause the value of these 
investments to plummet, because people will 
suddenly expect there to be some huge problem 
with Russian oil. Then again, this statement could 
be a strategem by the Russians, and they might not 
actually intend to sell their oil fields at all. 


Does this mean that your analysis was wrong? 


Sharpen your pencil 


What should you do with this new information? 
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the tools are working 


^harpen your pencil 

Solution 


Were you totally off base? 

The analysis dc-f'mi-tdy y/asr/ 七 |*t addura*tcly rc-flcd-tcd bclic-fs 七 he a^alys-U made wi-th 

linr\i*tcd da*boi- The pvoblcir^ is simply *tha*t *thc analysts wevc wv-oh^. There is y\o reason *to believe *tha*t 

us*m^ subjective probabilities C|ua\rar\*tccs 七 -those probabilities will be vijh 七 . 

What now? 

Wt Y\ttd bo Jo badk By\A revise all *thc subjective p\robabili*tics. Noy/ >wc have more ar\d better 
*m-fo\rinaa*tioh, our subjective probabilities ave likely *bo be more addu\ra*tc- 



Weve picked up a lot of analytic tools so far. 
Maybe one of them could be useful at figuring 
out how to revise the subjective probabilities. 






i 
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subjective probabilities 


Sharpen your pencil 


Better pick an analytic tool you can use to incorporate this new 
information into your subjective probability framework. Why would you 
or wouldn’t you use each of these? 


Experimental design? 


Optimization? 


A nice graphic? 


Hypothesis testing? 


Bayes' rule? 
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the return of bayes 

「 Sharpen your pencil 



Better pick an analytic tool you can use to incorporate this new 
information into your subjective probability framework. Why would you 
or wouldn’t you use each of these? 


Experimental design? 

|*t’s k'md o-f hard to sort you dould v-uh *bo better S’mde all *the 

analysts arc cvaluatm^ gco^oli-ti^al evc^-b, i*t seems *th3*t every single piede o-f dd*bd -they arc lookmj a*t 

is obsc\rva*tiohal. 

Optimization? 

There is y\o hard humcridal dd*bd| The op*timiz^*tioh -tools y/c^vc ledv*r\edi presuppose you have 
humevidal dd*td dhd d humeridal result you ⑹如七七。 ma^iimiz^ or No-thmj -fov op*timiz < a*tioh here. 


A nice graphic? 

There’s almost always room -for a hide da*ta visualization. Oy\Cc revised *the subjed-tive probabilities, 


y/dl dc\rta*mly y/arrt a hey/ visualization, bu 七 -for y\o^, y/c heed d tool 七 jives us brt*tcv numbers. 


Hypothesis testing? 

There is dc-f*mi*tdy d vole -for hypo-thesis *m problems like *this ohe, *the a^alys-ts r^ijlvt use 

i*t bo derive *thciv beliefs about Russians behavior. Bu*t our job is *bo -figure ou*t spcdi-fidally hoy/ *thc hew 

dd*bcl dhar\ys peopled subjective probabilities? ar\d its r\o*t deav hoy/ hyj>o*thcsis -tcstm^ y/ould do *tha*t* 

Bayes' rule? 

hloY/ 七 his sounds promis'mg. Us*m^ eddh analyst’s -firs-t subjective p\robabili*ty as a base ra*tc, maybe y/c 
use Baycs ； rule *bo pvodcss -this r\cy/ *m-fo\rima*tioh. 


216 Chapter 7 


























subjective probabilities 


Paycs' rule is great for revising 
subjective probabilities 

Bayes’ rule is not just for lizard flu! It’s great for 
subjective probabilities as well, because it allows you 
to incorporate new evidence into your beliefs about your 
hypothesis. Try out this more generic version of Bayes 5 
rule, which uses H to refer to your hypothesis (or 
base rate) and E to refer to your new evidence. 



P(L|+) 


s "the -fov*nr\ula you used "to -fiauv-c 
ou*t youv dhdrtdes or liz^vd -flu. 


P(L)P(+|L) 

P(L)P (+ IL) + P ( 〜 L)P(+ I 〜 L) 


The probability 

hypothesis, 
9 iv C^ the cvidc^c. 


TKc fvobab*iri*ty <^f *thc V>yfo*tV>cs*is. 


P(H IE) 




P(H)P(E IH) 


The p\robabiliiy 七 hat you’d 
see the cvidc^dc, givc^ ihai 
七 he hypo-thesis is tv-uc. 


This is y/ha*t you y/avrt. 




P(H)P(E IH) + P ( 〜 H)p(E I 〜 H) 

J l TV>c fvobabil! 切 youd 

he pv-obabili*ty 

y\itv\ 七 ha 七 

the hypo-thesis is -false. V>yfo*thcs*is is -false- 


Using Bayes’ rule with subjective probabilities is all 
about asking for the probability that you’d see 
the evidence, given that the hypothesis is true. 

After you’ve disciplined yourself to assign a subjective 
value to this statistic, Bayes 5 rule can figure out the rest. 


You already have 
these pieces of 
data: 


>u 


kv>o>w 七 his. 


Why go to this trouble? Why not just 
go back to the analysts and ask for 
new subjective probabilities based on 
their reaction to the events? 


The subjective probability that Russia will 
(and won’t) continue to subsidize oil 


P(H) 


P ( 〜 H) 


o 


0 


You just need to 
get the analysts 
to give you these 
values: 

l/Vha-t av-c -these? 


The subjective probability that the news 
report would (or wouldn’t) happen, given 
that Russia will continue to subsidize oil 

P(E IH) P(E| 〜 H) 


You could do that. 
Let’s see what 
that would mean.., 
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why use bayes here? 


Tonight’s talk： Bayes’ Rule and Gut Instinct smackdown 


Gut Instinct: Bayes’ Rule: 

I don’t see why the analyst wouldn’t just ask me 
for another subjective probability. I delivered like a 
champ the first time around. 

You did indeed, and I can’t wait to use your first 
subjective probability as a base rate. 

Well, thanks for the vote of confidence. But I still 
don’t appreciate being kicked to the curb once I’ve 
given the analyst my first idea. 

Oh no! You’re still really important, and we need 
you to provide more subjective probabilities to 
describe the chances that we’d see the evidence 
given that the hypothesis is either true or untrue. 

I still don’t get why I can’t just give you a new 
subjective probability to describe the chances that 
Russia will continue to support the oil industry. 

Using me to process these probabilities is a rigorous, 
formal way to incorporate new data into the 
analyst’s framework of beliefs. Plus, it ensures that 
analysts won’t overcompensate their subjective 
Would anyone ever actually think like this? Sure, probabilities if they think they had been wrong. 

I can see why someone would use you when he 
wanted to calculate the chances he had a disease. 

But just to deal with subjective beliefs? 

OK, it’s true that analysts certainly don’t have to use 
me every single time they learn anything new. But if 
the stakes are high, they really need me. If you think 
you might have a disease, or you need to invest a ton 
of money, you want to use the analytical tools. 

I guess I need learn to tell the analyst to use you 
under the right conditions. I just wish you made a 
little more intuitive sense. 

If you want, we can draw 1,000 little pictures of 
Russia like we did in the last chapter … 

Not that! Man, that was boring... 


Fireside Chats 

華搴 
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subjective probabilities 



pu*t youv- -formula 
dr^a topy/faste 
i*t *fov" c3^ analyst 


ttcv-c^s Bayes v-ulc ajam. P(H | E) 


P(H)P(E IH) 


P(H)P(E IH) + P ( 〜 H)p(E I 〜 H) 


1 % 


E%ew：iSe 


Here’s a spreadsheet that lists two new sets of subjective probabilities that have been collected 
from the analysts. 


1) 


2 ) 


P(E|S1), which is each analyst’s subjective probability of Russia announcing that they’d 
sell their oil fields (E), given the hypothesis that Russia will continue to support oil (SI) 

P(E 卜 SI), which is each analyst’s subjective probability of the announcement (E) given that 
Russia won't continue to support oil ( 〜 SI) 


_This is p\robabili*ty 七 

hypothesis is *brue, 


the v\t^i cvidc^c- 


Write a formula that implements Bayes' rule to give you P(S1|E). 


www. headfirstlabs. com/books/hfda/ 
hfda_ch07_new_probs.xls 




"the "two hCw 
dolumhs o-p 
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revised subjective probabilities 





















ExeRciSe 

SotuiloH 


What formula did you use to implement Bayes’ rule and derive new 
subjective probabilities for Russia’s support of the oil industry? 


丁 iVis fovmula £.or»\biy>cs "tiic base 

v-a*tc v/*rth tKciv abou 七七 he ir>cv/ 

dd*kd *to dome uf> a assessment 


(B2*D2)/(B2*D2+C2*E2) 



n- 6 & 3 5 5 4 5 4 3 5 5 fi 6 4 2 5 5 5 9 6 


b 


t 


%%%%%%%%%%% 
S36256 兕 61665469671422 


%%%%%%%% 
^ 7 b B ft 4 5 o 
J s s b 5 5 6 6 5 
11 

l!= 

tl 

fm. 


%%%%%%%%%%%%%%%%%%%% 
3 2199S3 828218 2 10S919 

1 ± IX- Jl- ™ 1 ± di. 


/I 

cm 


%%%%%%%% 
239021^*1 
03899900 9 


%%%%%%%%%% 

9t12-7zs2BO. 




1234 56 78901734S67SQ-0 

1 1* J1 1- 1 1 T* 1- 1* _rx 
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subjective probabilities 


if^Sbarpen your pencil 


Using the data on the facing page, plot the new subjective 
probabilities of each analyst on the chart below. 


/l/Iakc -this diiavt 
shoy/ P (SI |E), youv 
v-cv*iscd fvobab*il*i*ty. 



0.0 0.2 0.4 0.6 0.8 1.0 


As a point of reference, here is the plot of people’s beliefs in the 
hypothesis that Russia would continue to support the oil industry 
as they were before the news report. 


This is P (SI), 七 he 
pviov subjective 
probabilities. 




0.0 0.2 0.4 0.6 0.8 


1.0 


How do you compare the new distribution of subjective 
probabilities with the old one? 
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only a small change 



How does the now distribution of beliefs about Russia’s support 
for the oil industry look? 




0.0 0.2 0.4 0.6 0.8 1.0 


How do you compare the two? 

The spread hCw se*t o-f subjective pv-obabili*tics is a little wider, bu*t ohly people assi^h 

to *the hypothesis subjed-tive f\robabili*tics 七 arc si^i-fidah*tly lower wha*t *thcy had *thoujlvt 
previously. For mos 七 people, i 七 s*till seems drouhd 於％ likely 七 Russia will doh*tmuc *to suJ>po\rt oil, 
ever\ -though Russia dlair^s *to be scll'm^ *theiv oil -fields. 
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subjective probabilities 


The CEO knows exactly what to 
do with this new iwformatiow 


Everyone is selling their Russia holdings, 
but this new data on my analysts* beliefs 
leads me to want to hold on to ours. 

Lefs hope this works! 



Badk>wa*tcv- |y>vcs*tmcv>is CB-0 


The r>cws dbou't 


selling the oil -fields. 



On close inspection, the analysts concluded 
that the Russian news is likely to report the 
selling of their oil fields whether it’s true that 
they will stop supporting oil or not. 

So the report didn’t end up changing their 
analyses much, and with three exceptions, 
their new subjective probabilities [P(S1 | E)] 
that Russia would support their oil industry 
were pretty similar to their prior subjective 
probabilities [P(S1)] about the same 
hypothesis. 


But are tke analysts rigfkt? 
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cha ching! 


The 灼 ev/s about 
scllmj -the oil -fields. 



Youv sc£.o^d 
analysis. 


Russian stock owners rejoice! 

The analysts were right: Russia was bluffing 
about selling off their oil fields. And the market 
rally that took place once everyone realized it 
was very good for Backwater. 

Looks like your subjective probabilities kept 
heads cool at Backwater Investments and 
resulted in a big payoff for everyone! 


Time 



y 0 u\r -f iv-st analysis Jt ： 

subjc^-tWc pv-obab'ilit»cs. 



-MavIJelu VP5S uelssnlM-o 3nle> 











8 heuristics 


參 參 

Analyze like a human ♦ 



The real world has more variables than you can handle. 

There is always going to be data that you can’t have. And even when you do have data 
on most of the things you want to understand, optimizing methods are often elusive and 
time consuming. Fortunately, most of the actual thinking you do in life is not “rational 
maximizing” 一 it’s processing incomplete and uncertain information with rules of thumb so 
that you can make decisions quickly. What is really cool is that these rules can actually 
work and are important (and necessary) tools for data analysts. 


this is a new chapter 
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meet the littergitters 


Littcrfrittcrs submitted their 
report to the city council 

The LitterGitters are a nonprofit group 

funded by the Dataville City Council 

to run public service announcements to 
encourage people to stop littering. 

They just presented the results of their 
most recent work to the city council, and 
the reaction is not what they’d hoped for. 



0 

o 



That last comment is the one we’re really 
worried about. It’s starting to look as if 
LitterGitters will be in big trouble very soon 
if you can’t persuade the city council that 
LitterGitters 5 public outreach programs have 
been a success relative to the city council’s 
intentions for it. 


O 


Were cutting your funding in 
1 month, unless you can come 
up with a way to show youve 
reduced litter tonnage. 
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heuristics 


The Litterfritters have really 
cleaned up this town 


Before the LitterGitters came along, Dataville was a 
total mess. Some residents didn’t respect their home 
and polluted it with trash, damaging Dataville’s 
environment and looks, but all that changed when 
LitterGitters began their work. 


It’d be terrible for the city council to cut their 
funding. They need you to help them get better at 
communicating why their program is successful so 
that the city council will continue their support. 

what 

does. 


I just know that our 
program works... help! 



o 


The d'iv-cd*tov 


Public service 
announcements 


Publications 


Clean-up 

events 


Education 
in schools 



^Sharpen your pencil 


di*ty tour\dil tu*bs 

Patavillc will 

b 沘 k *m*bo a b'»3 "brash dumf! 



Brainstorm the metrics you might use to fulfill the mandate. 
Where exactly \nou\6 litter tonnage reduction data corr 


from? 
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measure effectiveness 


r^Sbarpen your pencil 

Solution 




How exactly would you collect the data that would show whether 
the LitterGitters’work had resulted in a reduction in litter 
tonnage? 


lA/e dould have separate -from -tv-ash a^d weigh both repeatedly over -time. 

Oy we dould have special dolled-tiohs a*t glades *m Da*tavillc 七 ha 七 have a v-cpu*ta*tioh -for bemji -filled 
y/i*th litter. Has Litter 今 itters bcch mdk'rn^ -these so\rt o-f mcasu\rcmch*ts? 


The Littcrfrittcrs have been measuring 
their campaign's effectiveness 


LitterGitters have been measuring 
their results, but they haven’t measured 
the things you imagined in the 
previous exercise. They’ve been doing 
something else: surveying the general 
public. Here are some of their surveys. 



Their tactics, after all, are all about 
changing people’s behaviors so that 
they stop littering. Let’s take a look at a 
summary of their results … 
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heuristics 


Questions for the general public 

Last year 

This year 

Do you litter in Dataville? 

10% 

5% 

Have you heard of the LitterGitters program? 

5% 

90% 

If you saw someone littering, would you tell them to throw their 
trash away in a trash can? 

2% 

25% 

Do you think littering is a problem in Dataville? 

20% 

75% 

Has LitterGitters helped you better to understand the importance 
of preventing litter? 

5% 

85% 

Would you support continued city funding of LitterGitters' 
educational programs? 

20% 

81% 


The mandate is to reduce t ⑹於如 〆 

the tonnage of litter pcoplc 叫 。 v ’ 


And educating people about why they need to change 
their behaviors will lead to a reduction in litter tonnage, 
right? That’s the basic premise of LitterGitters, and 
their survey results do seem to show an increase in public 
awareness. 

But the city council was unimpressed by this report, and 
you need to help LitterGitters figure out whether 
they have fulfilled the mandate and then persuade the 
city council that they have done so. 


Sharpen your pencil 




Does the LitterGitters' results show or suggest a reduction 
in the tonnage of litter in Dataville? 
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elusvie tonnage 


^Sharpen your pencil 

Solution 


Does the data show or suggest a litter tonnage reduction 
because of LitterGitters'work? 


|*t su^cs-t d vcdud*tioh, i-f you believe peopled rcpovtcd m beliefs has impad*t oy\ 

litter. Bu 七 *the dd*bd itsel-f or\ly discusses public op’mioh, dhd *thcvc is ho*tlVm 分 ih i 七 c%flidi*tly abou*b 
li*t*tcr -boh^ay. 


Towage is unfeasible to measure 



Of course we dotVt measure tonnage. Actually 
weighing litter is way too expensive and logistically 
complicated, and everyone in the field considers 
that Databurg 10% figure bogus. What else are we 
supposed to do besides survey people? 


This could be a problem. The city 
council is expecting to hear evidence 
from LitterGitters that demonstrates 
that the LitterGitters campaign has 
reduced litter tonnage, but all we 
provided them is this opinion survey. 

If it’s true that measuring tonnage 
directly is logistically unfeasible, then 
the demand for evidence of tonnage 
reduction is dooming LitterGitters to 
failure. 




The 

div-Cd-tov 
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heuristics 


frive people a hard question and they'll 
answer an easier one instead 


LitterGitters knows that what they are 
expected to do is reduce litter tonnage, but 
they decided not to measure tonnage directly 
because doing so is such an expense. 


This is complex, 
expensive, and hard. 


YouVc bo v\ttd a b'15 sdalc b> >wei 吵 all 七 Ms … 



This is fast, easy, and 
clear. It’s just not what 
the city council wants. 


Reacting to difficult questions in this way is 
actually a very common and very human 
thing to do. We all face problems that are 
hard to tackle because they’re “expensive” 
economically — or cognitively (more on this 
in a moment) — and the natural reaction is to 
answer a different question. 

This simplified approach might seem like 
the totally wrong way to go about things, 
especially for a data analyst, but the irony is 
that in a lot of situations it really works. And, 
as you’re about to see, sometimes it’s the 
only option. 


I Questions for the general public 



Questions for the general public 

Your 

answer 

Do you litter in Dataville? 

No 

Have you heard of the LitterGitters program? 

A 

If you saw someone littering, would you tell them to throw their 
trash away in a trash can? 

Vis 

Do you think littering is a problem in Dataville? 

Ves 

Has LitterGitters helped you better to understand the importance 
of preventing litter? 

Ves 

Would you support continued city funding of LitterGitters' 
educational programs? 



Hcvc avc some o( *tV>c ofmion suvveys 

^oi badk -fvom fcoflc. 
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litter systematics 


UttGriwg iw Pataville is 
a complex system 

Here’s one of LitterGitters’ internal research 
documents. It describes things you might want to 
measure in the world of litter. 


Parks 

Drainage ditches 
Roadside ditches 
Suburban neighborhoods 
Urban neighborhoods 


Where it’s found 


Types of litter 


Adults General 



Public 

Children 

Solid waste workers 

、、、、、、 

People you can survey about litter 

Litter in Dataville: 

Public health officials 

Things you can 

Academic researchers 

_ 

measure 


Public Litter cleanup 
contractors 

Private 

Litter reduction outreach programs 
Negative public health outcomes 
Non-residents' poor impression of the city 
Wild animals injured by litter 


Places to 
measure 
tonnage 


Costs you can measure 


All of these factors for 
neighboring cities 


Cigarette 

butts 

Pet waste 

Plastic food wrappers 
Bottles and cans 
Leftover food 
Fishing line 
Weird, random stuff 


Street sweeper 
processing station 

LitterGitters cleanup activities 

Public landfills 

Private landfills 

Where it's found 


And here is the director’s 
explanation of this big system 
and the implications that its 
complexity has for the work of 
LitterGitters. 


From: Director, LitterGitters 
To: Head First 

Subject: Why we can’t measure tonnage 

In order to measure tonnage directly, we，d need staff at all the 
contact points (processing stations, landfills, etc.) at all times. The 
city workers won’t record the data for us，because they already 
have plenty of work to do. 

And staffing the contact points would cost us double what the city 
already pays us. If we did nothing but measure litter tonnage, we 
still wouldn’t have enough money to do it right 

Besides，the city council is all wrong when it focuses on tonnage. 
Litter in Dataville is actually a complex system. There are lots of 
people involved, lots of types of litter，and lots of places to find it. 
To ignore the system and hyper-focus on one variable is a mistake. 
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heuristics 


You caw't build and implement a 
unified litter - measuring model 

Any sort of model you created to try to measure 
or design an optimal litter control program would 
have an awful lot of variables to consider. 

Not only would you have to come up with a 
general quantitative theory about how all 
these elements interact, but you’d also have to 
know how to manipulate some of those variables 
(your decision variables) in order to minimize 
tonnage reduction. 


I 灼 ovdcv -to op'tirwizjC 

*fco \cy\o\n ChtiVC 


:, you have 
ys-tem. 



/W objcttWc 出 cm s^ows V^ov/ you 

y/ar\*t *to maxWizjC you\r objective 
m an optimization pv-oblcm. 


£000 vaviablcs 
•fco doy>sidcv iicv*c- 



The di*ty douWil -to yt littcv- 

as low as possible, By\A v/e *to show 
■that p\royam docs this. 


This problem would be a beast even if you had 
all the data, but as you’ve learned getting all the 
data is too expensive. 


Is giving tke city council wkat 
tkey want even possible? 
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handle all these variables 


Jill: This situation is a mess. We have a city 
council asking for something we can’t give them. 

Frank: Yeah. And even if we could provide 
the tonnage reduction figure, it would not be of 
much use. The system is too complex. 

Joe: Well, that figure would the satisfy city 
council. 

Jill ： Yes, we’re not here just to satisfy the 
council. We’re here to reduce litter! 

Joe: Couldn’t we just make something up? Like 
do our own “estimate” of the tonnage? 

Frank: That’s an option, but it’s pretty dicey. I 
mean, the city council seems like a really tough 
group. If we were to make up some subjective 
metric like that and have it masquerade as a 
tonnage metric, they might flip out on us. 

Jill: Making up something is a sure way to get 
LitterGitters 5 funding eliminated. Maybe we 
could persuade the city council that opinion 
surveys really are a solid proxy for tonnage 
reduction? 

Frank: LitterGitters already tried that. Didn’t 
you see the city council screaming at them? 

Jill: We could come up with an assessment 
that incorporates more variables than just 
public opinion. Maybe we should try to collect 
together every variable we can access and just 
make subjective guesses for all the other variables? 

Frank: Well, maybe that would work... 
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heuristics 


Stop! Were making this 
way too complicated. Why 
caiVt we just pick one or two 
more variables, analyze them 
too, and leave it at that? 


You can indeed go with just a 
few more variables. 

And if you were to assess the effectiveness 
of LitterGitters by picking one or two 
variables and using them to draw a 
conclusion about the whole system, you’d 
be employing a heuristic … 
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meet heuristics 


Heuristics area middle ground between 
going with your gut and optimization 


Do you make decisions impulsively, or with a few 
well-chosen pieces of key data, or do you make 
decisions by building a model that incorporates 
every scrap of relevant data and yields the perfect 
answer? 


Your answer is probably “All of the above,” and 
it’s important to realize that these are all different 
ways of thinking. 


Intuition is seeing 
one option. 



Irrturbo 於 be 

sdavy -fov- analysis. 


/l/jaybc you dem ’ 七 *to 
\Uor^orait all data. 


Heuristics are seeing 
a few options. 


/\/Iost youv- takes pladc 



Aalysts avoid vclymg ^ ihtuitioh 

they ca^, but dc^isiohs you make v-cally quidklv 

ohtk 叫 地 0 1 have 匕 Ji: 


If you’ve solved an optimization problem, you’ve 
found the answer or answers that represent 
the maximum or minimum of your objective 
function. 

And for data analysts, optimization is a sort of 
ideal. It would be elegant and beautiful if all 
your analytic problems could be definitively 
solved. But most of your thinking will be 
heuristic. 


Wkick will you 
use lor your ctata 
analysis problems? 
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heuristics 


-SclioJaT's Corner 



ttcu\ris*tid I- (Psydholo^idal dc-Pmrtio^.) Subs*ti*tu*t*m^ d 
di 以 idul 七 o\r 於 bribu 七 e -fo\r d move dddessible 

ohC- Z- (Compu*tc\r sdic^dc dc-f mi"bioh ) /\ y/3y 0-(* solvih^ 
a problem will *tc^d *to ^ivc you addura-tc 
ahsv/c\rs bu*t *tii3*b does ho^b ^uar^^ee 



Optimization is seeing 
all the options. 



Some psychologists even argue that all human 
reasoning is heuristic and that optimization is 
an ideal that works only when your problems are 
ultra-specified. 

But if anyone’s going to have to deal with an ultra- 
specified problem, it’ll be a data analyst, so don’t 
throw away your Solver just yet. Just remember 
that well-constructed heuristic decision-making 
protocols need to be part of your analytic toolkit. 


is 

•deal -Pov- analysts 


|s W of*ti mi z^-t 10^^ 
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heuristics as data analysis 


tJiereiare no o 

Dumb Questi9ns 


It seems weird that you’d have a 
decision procedure that didn’t guarantee 
a correct answer and call it “data 
analysis ■” Shouldn’t you call that sort of 
thing “guesswork"? 

Now that wouldn't be very nice! Look, 
data analysis is all about breaking down 
problems into manageable pieces and 
fitting mental and statistical models to data 
to make better judgements. There’s no 
guarantee that you'll always get the right 
answers. 

Can’t I just say that I’m always 
trying to find optimal results? If I’ve got 
to dabble in heuristic thinking a little, fine, 
but my goal is optimality? 

That’s fair to say. You certainly don’t 
want to use heuristic analytical tools when 
better optimizing tools are available and 
feasible. But what is important to recognize 
is that heuristics are a fundamental part of 
how you think and of the methods of data 
analysis. 

So what’s the difference between 
the psychological and the computer 
science definition of “heuristics”? 

They’re actually really similar. In 
computer science, heuristic algorithms have 
an ability to solve problems without people 
being able to prove that the algorithm will 
always get the right answer. Many times, 
heuristic algorithms in computer science 
can solve problems more quickly and more 
simply than an algorithm that guarantees the 
right answer, and often, the only algorithms 
available for a problem are heuristic. 

What does that have to do with 
psychology? 


Psychologists have found in 
experimental research that people use 
cognitive heuristics all the time. There is just 
too much data competing for our attention, 
so we have to use rules of thumb in order 
to make decisions. There are a number of 
classic ones that are part of the hard-wiring 
of our brain, and on the whole, they work 
really well. 

Isn’t it kind of obvious that human 
thinking isn’t like optimization? 

It depends on who you talk to. People 
who have a strong sense of humans as 
rational creatures might be upset by the 
notion that we use quick and dirty rules 
of thumb rather than think through all our 
sensory input in a more thorough way. 

So the fact that a lot of reasoning is 
heuristic means that I’m irrational? 

It depends on what you take to be 
the definition of the word “rational.” If 
rationality is an ability to process every bit 
of a huge amount of information at lightning 
speed, to construct perfect models to make 
sense of that information, and then to have 
a flawless ability to implement whatever 
recommendations your models suggest, then 
yes, you're irrational. 

That is a pretty strong definition of 
“rationality ■” 

Not if you’re a computer. 

That’s why we let computers do 
data analysis for us! 

Computer programs like Solver live in 
a cognitive world where you determine the 


inputs. And your choice of inputs is subject 
to all the limitations of your own mind and 
your access to data. But within the world 
of those inputs, Solver acts with perfect 
rationality. 

And since “AN models are 
wrong, but some are useful," even the 
optimization problems the computer 
runs look kind of heuristic in the broader 
context. The data you choose as inputs 
might never cover every variable that has 
a relationship to your model; you just 
have to pick the important ones. 

Think of it this way: with data analysis, 
it’s all about the tools. A good data analyst 
knows how to use his tools to manipulate the 
data in the context of solving real problems. 
There’s no reason to get all fatalistic about 
how you aren’t perfectly rational. Learn the 
tools, use them wisely, and you'll be able to 
do a lot of great work. 

But there is no way of doing data 
analysis that guarantees correct answers 
on all your problems. 

No, there isn't, and if you make the 
mistake of thinking otherwise, you set 
yourself up for failure. Analyzing where 
and how you expect reality to deviate from 
your analytical models is a big part of data 
analysis, and well talk about the fine art of 
managing error in a few chapters. 

So heuristics are hard-wired into 
my brain, but I can make up my own, too? 

You bet, and what’s really important 
as a data analyst is that you know it when 
you’re doing it. So let’s give it a try... 
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heuristics 


Use a fast and frugal tree 

Here’s a heuristic that describes different 
ways of dealing with the problem of having 
trash you need to get rid of. It’s a really 
simple rule: if there’s a trash can, throw it 
in the trash can. Otherwise, wait until you 
see a trash can. 

This schematic way of describing a 
heuristic is called a fast and frugal tree. 

It’s fast because it doesn’t take long to 
complete, and it’s frugal because it doesn’t 
require a lot of cognitive resources. 


I’m done with my food wrapper. 



What the city council needs is its 
own heuristic to evaluate the quality 
of the work that LitterGitters has 
been doing. Their own heuristic is 
unfeasible (we’ll have to persuade 
them of that), and they reject 
LitterGitters ? current heuristic. 

Gan you draw a fast and frugal tree 
to represent a better heuristic? Let’s 
talk to LitterGitters to see what they 
think about a more robust decision 
procedure. 



Lit^C fitters’ keuristic 


e have better attitudes 
fter LitterGitters? 
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simplifying assessment 


Is there a simpler way to assess 
UttGrftittGrs' success? 


Using a heuristic approach to measure 
LitterGitters’ work would mean picking one or 
more of these variables and adding them to your 
analysis. What does the LitterGitters director think 
would be the best approach? 



o-f -these variables you add 
-to youv analysis bo yve a -fullcv- 

o( C-f^Cd*t*lVCir>CSS? 



Parks 

Drainage ditches 
Roadside ditches 
Suburban neighborhoods 
Urban neighborhoods 


Where it's found 


Adults General 
Public 

Children 

Solid waste workers 
Public health officials 
Academic researchers 


>U 


Types of litter 


Cigarette 

butts 

Pet waste 

Plastic food wrappers 
Bottles and cans 
Leftover food 




、、 

、、、、 

i 

Fishing line 






Weird, random stuff 

、、、— —— 」 


1 、、、、、 

Litter in Dataville: 





■ 

People you can survey about litter 

Things you can 



〆〆〆〆 

Street sweeper 

1 

_ J 

measure 


,〆 

processing station 



Places to 
measure 


Public Litter cleanup 

、、、、、、、 / \ 

1 職 

^ - 

contractors 

Private 

、、、、 / \ 


Litter reduction outreach programs 

、、、/ 


Negative public health outcomes 

Non-residents' poor impression of the city 

Costs you can measure 

< 

、、 


All of these factors for 
neighboring cities 

1 

Wild animals injured by litter 

一 一一一一 


1 

1 

一一 一一 一 




LitterGitters cleanup activities 
Public landfills 
Private landfills 
Where it's found 


You just can’t leave out public opinion surveys. 

And, like Ive said, there is just no way to weigh all the 
litter in order to make a good comparison. But maybe you 
could just poll the solid waste workers. The biggest problem 
is cigarette butts, and if we periodically poll the street 
sweepers and landfill workers about how many butts theyre 
seeing, wed have a not totally complete but still pretty solid 
grip on what is happening with litter. 
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heuristics 


^happen your pencil 


Draw a fast and frugal tree to describe how the city council 
should evaluate the success of LitterGitters. Be sure to include 
two variables that LitterGitters considers important. 

The final judgment should be whether to maintain or eliminate 
the funding of LitterGitters. 
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your first heuristic 

r Sharpen your pencil 



What heuristic did you create to evaluate whether LitterGitters 
has been successful? 


m\t your ov/r. beet 啪吵 七 be a 於 

y/iicv"c you 七 have tt\AtA up. 


Fivs*t, 匕 i*ty 匕 oim 匕 il 灼 ceds *to 

dsk y/hc*thc\r *thc f>ublid is vcadtmg 
positively *to Li*t*tcv-^i*t*tcv-s. 


|s L*i*t*tcv^rt*tcv-s mdvcas'm^ pcoflc 
⑼七 husiasm about Irttev? 



Has the public increased 
its litter awareness? 


|-f *thc fubli^ is suffovtivc, do ^ 'N 
the solid y/as*tc v/ov-kev-s 七 hmk ^ 

*thcv-c has bee^ a vedu^tio 灼 ? 




l-f ⑽七 , Littev^ittevV 
should be 
cliw>ma*tcd- 


Do solid waste workers believe Kill funding 
there has been a reduction? 




Tii'is atbr’Ue is y/V^a*t 
wcVc substi-tutmj -fo\r 
dive^tly mcasuv-'rnj *tormay. 


r 一 

Here s ouUornc Utt 以 *tWs v/ants. 


Kill funding 

T ) 

( l-P the solid waste y/o\rkcvs dcml 
th'mk ihcv-c ； s been a^y c-P-Pcdt, 
"thats -the c^d 
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heuristics 


I am looking forward to seeing that report 
I hear you redid. But I*m expecting you to 
be like all the other nonprofits that get 
Dataville money... a bunch of incompetents. 


It sounds as if at least one of the city 
council members has already made up 
his mind. What a jerk. This guy totally 
has the wrong way of looking at the work 
of LitterGitters. 



City ^.ou^il mCmbcv 


Sharpen your pencil 


This city council member is using a heuristic. Draw a diagram 
to describe his thought process in forming his expectation 
about LitterGitters. You need to understand his reasoning if you 
are going to be able to persuade this guy that your heuristic 
assessment ideas are valid. 
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dreaded stereotypes 


^Sharpen your pencil 

Solution 


How does it seem this unpleasant city council member is forming 
his expectations? 


How do I jud^c 


From my c^pcvichdc ； 
y/ha*t o*thcv- 
hOhpro-fi*ts like? 


|*t seems as i-f \\t is^*t 
m*tcv-cs*tcd m Li*t*tcv-^i*t*tcv-s 
i*tscl-f... his o*thcv c>cpcv-ic^^cs 
3 V-C his v-cadtio^. 


hor\p\rc^fi*ts are 
mdoir«pc*tch*t- 


Li*t*ter^i*t*tev-s are 
mdompe-tcht 


Stereotypes arc heuristics 


Stereotypes are definitely heuristics: they don’t 
require a lot of energy to process, and they’re super¬ 
fast. Heck, with a stereotype, you don’t even need to 
collect data on the thing you’re making a judgement 
about. As heuristics, stereotypes work. But in this 
case, and in a lot of cases, stereotypes lead to poorly 
reasoned conclusions. 


How do I judge 
LitterGitters? 


Not all heuristics work well in every case. A fast 
and frugal rule of thumb might help get answers 
for some problems while predisposing you to make 
inadequate judgements in other contexts. 


▼ 

Ask some probing 
questions. 


mu 乙 h bc*t*tcv* wav "to 


j ud 3 c L 

be som 


tcv* way i 

Lrt*tcv^i*t*tcvs would 
like this: 



Are their answers 
impressive? 


Heuristics can be downright dangerous! 



LitterGitters are quite 
sharp. 



LitterGitters are 
incompetent. 
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heuristics 



Let’s see what those sanitation 
workers have to say... 
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ready for takeoff 


Your analysis is ready to present 


Between your heuristic and the data 
you have, including the just-received 
responses from the sanitation 
workers below, you’re ready to start 
explaining what you see to the city 
council. 


ttevVs Ko>w you decided d'i*ty douy>dil 
should assess 七 he >wov*k <Jc 


Has the public increased 
its litter awareness? 





Do solid waste workers believe Kill funding 
there has been a reduction? 


Tes 


\ 


Maintain funding 


Kill funding 



Heve’s ouv o\ri^*mdl dd'td 
dcsdv-ibmj ihc aiiiiudcs o( -the 
public about Irttev*. 


some data dcsdv-ibmj 
■the sa^i*ta*tio^ y/o\rkcv-s ， impressions 

o( \y\ Pa*tavillc s\y\Cc 
Littcv^ittevs theiv pv-ojv-am. 


Questions for the general public 

Last year 

This year 

Do you litter in Dataville? 

10% 

5% 

Have you heard of the LitterGitters program? 

5% 

90% 

If you saw someone littering, would you tell them to throw their 
trash away in a trash can? 

2% 

25% 

Do you think littering is a problem in Dataville? 

20% 

75% 

Has LitterGitters helped you better to understand the importance 
of preventing litter? 

5% 

85% 

Would you support continued city funding of LitterGitters' 
educational programs? 

20% 

81% 


Questions for the sanitation workers 

Have you noticed a reduction in litter coming into Dataville 
landfills since LitterGitters began their work? 

Are there fewer cigarette butts being collected off the streets 
since LitterGitters began their work? 

Have high-litter areas (downtown, city parks, etc.) seen a 
reduction in litter since LitterGitters began their work? 

Is littering still a significant problem in Dataville? 


This year 

75 % 


VJt da〆 七 dompavc 七 his -fijuire bo 
ycav-, bedduse we just s-tavtcd 
dollcdtmj data -Pov- -this \rcpo\rt- 


>-P people who answered w ycs. w 
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heuristics 


Sharpen your pencil 


Why can’t you measure tonnage directly? 


Answer the following questions from the city council about 
your work with LitterGitters. 


Can you prove that the campaign had an effect? 


Can you guarantee that your tactics will continue to work? 


Why not spend money on cleanup rather than education? 


You guys are just as incompetent as the others. 
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friendly interrogation 


Sharpen your pencil 
k Solution 


Why can’t you measure tonnage directly? 


How did you respond to the city council? 


Wt C^y\ measure -tohhajc dircd-tly- The pvoblcm W\{\\ do'm^ i-t) *thoujh, is *tha*t i*t ； d be *boo oepe^sivc. 
|*t’d dost b^\Ct *the a^ouh*t o-f mo^cy you ad-budlly fay Litter 今 itters *bo do 七 heiv* work. So -the bes*t 
bourse o-f ad*tioh is *bo use this heu\ris*tid 七 。 assess pc\r-fo\ri^ahde. |*t’s simple bu*t m our belief adduvate. 

Can you prove that the campaign had an effect? 

All *thc dd*boi is obsc\rva*tiohal, so we pvovc 七 *the mdrease m awareness of -the jchcv-al public 
about li*t*tcv- ar\d *thc rcdud*tioh 七 sar\i*ta*tioh y/orkc\rs believe has *takeh plade is *the result 
Li*t*tcv • 今 i*btc\rs. Bu*t wc have ^ood reasons *bo believe 七 our y/ds *the dduse -these results. 

Can you guarantee that your tactics will continue to work? I Hmmm. Ifs like you 

f actually know what 

There arc hever cjuarah*tccs m life, but as lor\^ as we sus*ta*m\^ youre talking about. 

■the inaproved public, awareness *tha*t Gme ou*t our ou*breadh 

i*t’s hard *to *tha 七 people will suddenly v-esume li*t*tcv*mg| more 

Why not spend money on cleanup rather than education? 

Bu*t *m 七 C^st, your objective y/ould^-t be to reduce li*t*tcr, because you d 


be domg ho-thmj bo jc*t people to s*boj> littevmjj. The objective y/ould be bo 
i*t uj> as -fas*t as you c^y\, a^d *tha*t’s 的。七 wha*t Li*t*tcv^i*t*tcrs does. 

You guys are just as incompetent as the others. 

Wt da^*t speak -for other hor\pv-o-fi*b, bu*t we have a Crystal dlcav idea c^f 

wha*t weVe do'm^ a^d how -bo measure *thc v-csul-ts, so weVe dc-f mi-tdy ho*t 

W\\cy\ did you say you y/ev-e up -for v-cclcd*tior\? 
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heuristics 


Looks like your analysis impressed 
the city council members 


Memorandum 

Re: LitterGitters and litter in Dataville 

The city council is pleased to renew the 
contract of LitterGitters, thanks to the excellent 
work from the Head First data analyst. We 
recognize that our previous assessment of the 
work of LitterGitters did not adequately treat 
the whole issue of litter in Dataville, and we 
discounted the importance of public opinion 
and behavior. The new decision procedure you 
provided is excellently designed, and we hope 
the LitterGitters continue to live up to the high 
bar they have set for themselves. LitterGitters 
will receive increased funding from the Dataville 
City Council this year, which we expect will 
help … 


Thanks so much for your help. Now there 
is so much more well be able to do to 
get the word out about stopping litter in 
Dataville. You really saved LitterGitters! 


Dataville will stay clean 
because of your analysis. 

Thanks to your hard work and subtle 
insight into these analytical problems, you 
can pat yourself on the back for keeping 
Dataville neat and tidy. 
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9 ]i!st9granis 


^7y 

f The shape of numbers 

伞 參 



How much can a bar graph tell you? 

There are about a zillion ways of showing data with pictures, but one of them is special 
Histograms, which are kind of similar to bar graphs, are a super-fast and easy way to 
summarize data. You’re about to use these powerful little charts to measure your data’s 
spread, variability, central tendency, and more. No matter how large your data set is, 
if you draw a histogram with it, you’ll be able to “see” what’s happening inside of it. And 
you’re about to do it with a new, free, crazy-powerful software tool. 


this is a new chapter 
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quest for recognition 


Your annual review is coming up 

You’ve been doing some really excellent 
analytical work lately, and it’s high time you 
got what’s coming to you. 

The powers that be want to know what you 
think about your own performance. 


Starbuzz Analyst Self-review 

Thank you for filling out our self-review! This document is important for 
our files and will help determine your future at Starbuzz. 


Date - 

Analyst Name -- 

Circle the number that represents how well-developed you 
consider your abilities to be. A low score means you think you 
need some help, and a high score means you think your work is 

excellent. 


The overall quality of your analytical work. 

1 2 3 


4 


5 


Your ability to interpret the meaning and importance of past events. 

1 2 3 4 5 

Your ability to make level-headed judgements about the future. 

1 2 3 4 5 


0 \\ boy, a scl-P evaluaiioh. 


Bet you’d sdo\rc 
how thah you would 
have \y\ (ihaptfv // 


Quality of written and oral communication. 

1 2 3 


4 


5 


Your ability to keep your client well-informed and making good choices. 

1 2 3 4 5 


Your work is totally solict. 


You deserve a pat on the back. 

Not a literal pat on the back, though... 
something more. Some sort of real 
recognition. But what kind of recognition? 
And how do you go about actually getting it? 
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histograms 


Sharpen your pencil 


You’d better brainstorm about strategies to get recognized. Write 
down how you’d respond to each of these questions. 


Should you just say thanks to your boss and hope for the best? If your boss really believes you’ve 
been valuable, he’ll reward you, right? 


Should you give yourself super-positive feedback, and maybe even exaggerate your talents a 
little? Then demand a big raise? 


Can you envision a data-based way of deciding on how to deal with this situation? 
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is more money in your future? 


We so deserve a raise. 
But how do we get the 
boss to give it to us? 


However you answered the questions 
on the last page, we think you should go for 
more money. You’re not doing this hard work 
for your health, after all. 


ftoiwg for more cash could play 
out iw a bunch of different ways 



People can be skittish about trying to get more money from 
their bosses. And who can blame them? There are lots of 
possible outcomes, and not all of them are good. 


dould 


Ask for a little 



Ask for a lot 


— 广 


Do nothing 



Ri^vt y\ovj, you have 
y>o idea >wV>a*t youv- 
boss will *th'mk ov* do 




辛 

人 


Raise 



Incredible 

raise! 


You’ve been 
canned 


This would 
be v\\U^ 
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Could research help you predict the outcomes? 

Even though your case is unique to you, it still might make sense 
to get an idea of your boss’s baseline expectations. 






histograms 









Here's some data oh raises 

Because you’re so plugged in to Starbuzz’s 
data, you have access to some sweet numbers: 
Human Resource’s records about raises for the 
past three years. 





www. headfirstlabs. com/books/hfda/ 
hfda—ch09—employees, csv 



v domfav>y s v-a*iscs 


l*mc of the ddiabase 
\rcprcscivb someo^es vaise 
*ro\r ihc spcdi-Picd ycav-. 

vaisc 

measured as a 

i^vease. 


Vis dolum 朽 says y/V>c*thcv- *thc 
►CVSoy> is male ov -female... you 
y\oyi, tiicvc 你七 be a dovvclation 

灼 ^ci^dcv* and v^aisc amouy> , t- 


s Column"tells you 七 

soy\ asked -for nsort w'oncv ov not 
UE means asked + 0 V mo^rc 
sh, false means 七 d\A^\ t- 


You might be able to wring some powe 
insights out of this data. If you assume that 
your boss will act in a similar way to how 
previous bosses acted, this data could tell you 
what to expect. 

Problem is, with approximately 3,000 
employees, the data set is pretty big. 

VouVc jomj -to y\ttA io do 
"to make -the data useful. 


T"his d3*tol dould be o-p use 
"to you as you -Pigu\rc out 
what types vaiscs av-c 
v-caso^ablc -to expert. 





How would you deal with this data? 
Could you manage it to make it more 
useful? 
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data-based negotiation 


Jim: We should forget about the data and 
just go for as much as we can get. Nothing in 
there will tell us what they think we’re worth. 
There’s a range of numbers in the boss’s head, 
and we need to figure out how to get the upper 
end of that range. 

Joe: I agree that most of the data is useless 
to tell us what they think we are worth, and I 
don’t see how we find out. The data will tell 
us the average raise, and we can’t go wrong 
shooting for the average. 

Jim: The average? You’ve got to be joking. 
Why go for the middle? Aim higher! 

Frank: I think a more subtle analysis is in 
order. There’s some rich information here, and 
who knows what it’ll tell us? 

Joe ： We need to stay risk-averse and follow the 
herd. The middle is where we find safety. Just 
average the Raise column and ask for that 
amount. 

Jim: That’s a complete cop-out! 

Frank: Look, the data shows whether people 
negotiated, the year of the raise, and people’s 
genders. All this information can be useful to 
us if we just massage it into the right format. 

Jim: OK, smarty pants. Show me. 

Frank: Not a problem. First we have to figure 
out how to collapse all these numbers into 
figures that make more sense... 



Better summarize the data. There’s just too 
much of it to read and understand all at once, and 
until you’ve summarized the data you don’t really 
know what’s in it. 

Start by breaking the data down into its basic 
constituent pieces. Once you have those pieces, 
then you can look at averages or whatever other 
summary statistic you consider useful. 


Wkere will you 
tegin your summary 
of tkis ctata? 
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Sharpen your pencil 


pv-a>/ V^cv-c *to 

v"cfv-csc\r\*b iiow you d spirt 


As you know, much of analysis consists of taking information and 
breaking it down into smaller, more manageable pieces. 

Draw a picture to describe how you would break these data fields 
down into smaller elements. 


/•the da-ta m-to smallev 浐 etes. 


What statistics could you use to summarize these elements? 
Sketch some tables that incorporate your data fields with 
summary statistics. 




ffodooood&oaoo-Q-i-aOQO-alfl-oa-o 
Q o dald Q Q dd o Q 9 Q d o d Q Q Q d o d o 

2 


■r 


2 j i i 2 2 2 -2" j 2 


J 2 J J 


Q Q Q d 
2 2 2 2 


3 5^ 2 


0 

2 
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chunks of data 


2 帽 
2QQ9 

aoos 

mi 

w? 

20Q/A 

2005 




測 5 



2QQ5 





Sharpen your pencil 
、 Solution 


What sort of pieces would you break your data into? 


Wt^t dv*c some C'^aw'flcs... 
youv* ar>sy/cv-s be 

a little d*l-f-fcv-cir>t 


>u c^y\ bv-cak W\t da*ta m 

youV" C.olum\r\S m"to f»CdCJ 


IS... 


..... Raises o-f 

午 -5% 




tAty\ 


… you domb'me ■those pieces o( 
data y/i*th pieces -f\rorw oihev- dolum^s. 


Female 

.hC^o*tia*bo\rs..-* 





/ Raises 

••'• > 10 % ... 


/ Raises of ' 
\l% *m f 0l y 


tAtti *m 

.. 1009 


Here are a few ways you might integrate 
your data chunks with summary statistics. 


TWis -table sV^ov/s 
averd^e v-a»sc -for w>alc 
ar\d -fcw>3lc y\C5o*t>3*toV"S. 



jo-t loads 
o-f options hcv-c- 



•••• •••• 

•••• •••• 

•••• •••• 

Wo\n\CY\ 

•••• •••• 

••• •• •••••• 



i Nc^o-tia-tov-s : 

i Average raise, j 

i Average raise.; 

•••••• •••••• 




\r3ises m edch mt-cw 
o( v-d'ise possibilities. 




ttow \n\Sv\y 
people had 
s vaisc 
between 
厶 3hd 7 

pc\r^Cht? 



Let’s actually dvcaic 
^ ■ 七 his last visualization... 


0 - 1 % 
l-Z% 
M*% 
午 -5% 
5—i% 


i-7+% 


7 J s- Lp ? a? -ijL s. c 7 
Q d d OO'Q Q da-^a 

2 3 3322253 3 2 
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histograms 


0 


It sure is fun to imagine 
summarizing these pieces 
of the data, but here s a 
thought. How about we 
actually do it? 


Using the groupings of data 
you imagined, you’re ready to 
start summarizing. 


When you need to slice, dice, and 
summarize a complex data set, you want 
to use your best software tools to do the 
dirty work. So let’s dive in and make your 
software reveal just what’s going on with 
all these raises. 
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excel for histograms 


In OpenOffice and older versions of 

Excel, you can find Data Analysis under the 
Tools menu. ^ ^ J 




Tqst DriVQ 


A visualization of the number of people who fall in each 
category of raises will enable you to see the whole data set at 
once. 

So let’s create that summary... or even better, let’s do it 

graphically. 

o Open the Data Analysis 
dialogue box. 


I/Vith you\r data open m 

didk 七 he Data /Ualysis 
bui-to^ uv)de\r the Data tab. 


Data 


Review 


S 國 ▼ 






to Remove 
■ins Duplicates 野 ▼ 

Data Tools 


View 

® - 


啦 Data Analysis 

Outline 

T 

% 个 


Analysis \ 


F 


G 


H 


1( you dor/*t the Data 
Analysis burtbm, see 
iii (or ho>w -fco mstall i 七 . 


❺ Select Histogram. 


I 灼 pop — up w’mdov/, tell 

七 you 灼七 *to dveate a his-fcoyam. 
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histograms 


o 


Select your data. 


Be sure *to cMtcV i\\\s bo% so 
makes a 仏狀七 . 


Histogram 

Input 

input Jianget 



Sclcdt all you\r \raisc daia u^dev- 
the received 乙 olumh. 


I 涵 


5852 ： sa$™il j*- ! 



Cutpui options 
OjrtEwt Ronqc: 

9 New Worksheet 0 y: 

Mew Wusi kbuuk 

Pareto (sorted NEtogram) 
! Clhi Lit] live PcrDcntiqe 
/\ Chart CXjlput 



A 

_ 

C 

P 

1 

J 

I4!cf!ivud n 


e isunilw > 

2 

1 

Ii.1* 


M 

i 


s.^ mi€ 

4 

3 


TRUf 

M \ 

5 

■4 

T.l! 

TRWf 

F 

£ 

5 

30-55 

TRIif 

M 

1 

IS 

T? 

TKjf 


5 

T 

niii 

TWUt 


9 

E 


Tfuni, 


Id 

9 


TftUt 

M 

n 


mie p 

i? 

u 


TWUf 

M 

i) 


-9.7 ； 

myf 

F 





]/^\\cy\ yoiA\r v-a'isc data »s 
selected, *t^c\rc should be d 
\)\(>^ W w>3V*C-imr\^ sdc^*t>Oh 

bo% all tV^c way V-OW> ■bV^c -top. 


… "to 七 he bo-t-feorw. 






Jboj 

m a 



FALSE 


5.2 ； 

FALSE 


i.fr 

FAia| 



mss 

■- 

a 4 

FALSE 


$X 

FAUE 



false 



FALSE 


i.frl 

TALSE 



fALS 



FftLg 



o 


Run it 



Pvcss 0^ av>d 
le 七 、 v v-*ip! 


Wkat kappens wken 
you create tke ckart? 
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histograms for counting 


, FrcqiK^i 



rmm... -this X.-a%*is 
suv*C looks like a mess. 




ttcv-C^s *thc ou 七 fu 七 -fv-om E^dcl- 


Histograms show frequencies 
of groups of numbers 

Histograms are a powerful visualization 
because, no matter how large your data set 
is, they show you the distribution of data 
points across their range of values. 

For example, the table you envisioned in the 
last exercise would have told you how many 
people received raises at about 5 percent. 

七 kmd of 



This histogram shows graphically how many 
people fall into each raise category, and it 
concisely shows you what people are getting 
across the spectrum of raises. 



On the other hand, there are some problems 
with what Excel did for you. The default 
settings for the bins (or “class intervals”）end 
up producing messy, noisy x-axis values. The 
graph would be easier to read with plain 
integers (rather than long decimals) on the 
x-axis to represent the bins. 

Sure, you can tweak the settings to get those 
bins looking more like the data table you 
initially envisioned. 


But even this histogram 
has a serious problem. 
Can you spot it? 
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histograms 


&aps between bars m a histogram 
mean gaps among the data points 


In histograms, gaps mean that there is data 
missing between certain ranges. If, say, no 
one got a raise between 5.75 percent and 
6.25 percent, there might be a gap. If the 
histogram showed that, it might really be 
worth investigating. 

In fact, there will always be gaps if there are 
more bins than data points (unless your data 
set is the same number repeated over and over) 



HlSf agratns Up 
Cl^S6 



Does -this 七 ha 七 -there avc y\o people 

\ n\\o ^o*t v-a'iscs bc*t>wcc^ 3 . 总 ％? 

That’s exactly what the gap should mean, at least if 
the histogram is written correctly. If you assumed this 
histogram was correct, and that there were gaps between 
these values, you’d get the totally wrong idea. You need a 
software tool to create a better histogram. 


These rwo\rc 
lookih^ 
his-tog\rarws. 


~n 



The problem with Excel’s function is 
that it creates these messy, artificial 
breaks that are really deceptive. 

And there’s a technical workaround 
for the issues (with Excel, there’s 
almost always a workaround if you 
have the time to write code using 
Microsoft’s proprietary language). 


But it’s chapter 9, and you’ve been 
kicking serious butt. You’re ready 
for a software tool with more 
power than Excel to manage and 
manipulate statistics. 

The software you need is called R. 
It’s a free, open source program 
that might be the future of 
statistical computing, and you’re 
about to dive into it! 
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crank up a mean tool 


Install and run R 


Head on over to imm&Kr-pvoject.org 

to download R. You should have no 
problem finding a mirror near you that 
serves R for Windows, Mac, and Linux. 


Clidk doy/v>load I'mk. 


Oy\tt you ； vc -f iv-cd uf 七 he 
fv-oyam, you II see a y/*mdo>w 
七 ha 七 looks like -this- 




This little ^u\rso\r hc\rc v-cpv-csc^*ts *thc 
p\romp-t is y/hc\rc youll 



^ Relax 


The command prompt 
is your friend. 


Working from the 
command prompt is 
something you get the hang of quickly, 
even though it requires you to think a little 
harder at first. And you can always pull up a 
spreadsheet-style visualization of your data 
by typing edit (yourdata ). 
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histograms 




I 

R session so that you can come back to the 
Head First data when you’re not connected to 
the Internet you can type save . image (). 


So what did you download? First, take a look 
at the data frame from your download called 
: Employees.’’Just type this command and press 
Enter: 

Type the o-P -the ds 
-Pv*arr>c -to get R -to display j 


employees 


The output you see on the right is what R 
gives you in response. 



o( all 七 he v-oy/s 

•m -the data 





Generate a histogram in R by typing this command: 

hist (employees$received,- breaks=50) 


What docs this mcah? 


What do you think the various elements of the command mean? Annotate your response. 


Load data into R 

For your first R command, try loading 
the Head First Data Analysis script using the 
source command: 



this! 


source (''http: / / www. headf irstlabs . com/books/hfda/hfda • R") 


That command will load the raise data you 
need for R. You’ll need to be connected to the 
Internet for it to work Tf vnn wpmt tn spivp vni i r 




a.0.a.0.R.0.0.B a e s d0.0.R.e.a.B.CE0.0.0.B.B..0. 


0 0 i 4 9 9 71 2 £• 9 ? £■ £ A- £ A 
CJ 4 5 5-4345746335544 


- 2- 2- G 2 

r F k b 

lA 4 4 4 
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the maestro of histograms 



ExeRciSe 

SotytiOH 


What do you think this histogram command means? 

hist tells R io \ruh the histogram -fuhdtioh. The seto^d ⑶七 *tclls R V>ov/ *to 

dorvs-tvu^-t 七 he yo 十呼 . 

hist (employees$received,- breaks=50) 



The -fi\rs*t a\r^umCh*t spcdi-fics 



w 


ha*t da*ta to 


use. 


R creates beautiful histograms 

With histograms, the areas under 
the bars don’t just measure the 
count (or frequency) of the thing 
being measured; they also show the 
percentage of the entire data set being 
represented by individual segments. 


you \ruh the a 

wihdow pops up showihg this. 


Look carefully at the contour 
of the curve. A few things are 
obvious. Not a lot of people got 
raises below 0 percent, and not a 
lot of people got raises above 22 
percent. 

But what’s happening in the middle 
of the distribution? 



A \oi o( fcoflc ^o-t 
a va'isc av^ou 灼 d 弓 ％. 


a\rc *thc hijhes 七 
v-aiscs. 


Wkat cto you make 
oi tkis liistogram? 
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histograms 



These commands will tell you a little more about your data set and what people’s 
raises look like. What happens when you run the commands? 



1/Vhy do you R \rcsfoy>ds *to 
cadK Jc -these 七 he >way *i*t docs? 


What do the two commands do? 


Look closely at the histogram. How does what you see on the histogram compare 
with what R tells you from these two commands? 
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summary commands 



E^efictSe 

SoivtioH 


You just ran some commands to illustrate the summary statistics for your data set about raises. 
What do you think these commands did? 


Ov\ dvev-a^e, v-a*iscs avc 2 . 43 


What do the two commands do? 

The sd dommahd vc*tu\nr\s s-ta^davd 
deviation o-f 七 he dd*bd you s^edi-fy, 
dhd summary () dommShd shows you 
summdv~Y s^*tis*tids dbou*t *thc received 
dolurvm. 



\r3iscs people V-C^ivcd 


Look closely at the histogram. How does what you see 
on the histogram compare with what R tells you from 
these two commands? 

The his*boj\ram does a ^ood job <^f visually 
show'mj rmcdh ； dhd stdhddV'd 

deviation. Lookmj a*t i*t> you see *the 

e%ad*t -figures, but you c^y\ ge*t a sc^sc o-f 
those humbev-s by lookm^ a*t *the shape *the 
duvve. 
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histograms 





^Sbarpn your pencil 


Joe: If the histogram were symmetrical, the 
mean and median would be in the same 
place — in the dead center. 

Frank: Right. But in this case, the small 
hump on the right side is pulling the mean 
away from the center of the larger hump, 
where most of the observations are. 

Joe: I’m struggling with those two humps. 
What do they mean? 

Frank: Maybe we should take another look at 
those pieces of data we identified earlier and 
see if they have any relevance to the histogram. 


Joe: Good idea. 



daia youfi 呼 you 

imd^rnedi cav-licv. 


Can you think of any ways that the groups you identified earlier 
might explain the two humps on the histogram? 
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sorting out your histogram 

How might the groupings of data you identified earlier account 
for the two humps on your histogram? 

There dould be variation years ： -for example, raises *m 2-001 dould be oy\ average mudh higher 

raises -fyom ZOO^. ^v\d -there dould be je^dcr varia*tioh, -too ： mch dould, oh average, yt higher raises 

y/omCh, o\r vide versa. 0^ douvsq all -the dd*bd is obsc\rva*tiohal, so ar\y \rda*tiohshi^s you discover >wor/*t 
hedessarily be as s*tv*or\j as wha*t c%pcvimeh*tal da*boi would show. 


r ^Sharpen yuur pencil 
V Solution 


So it seems like we have a lot 
of flexibility when it comes to how the 
histograms look. 

It’s true. You should think of the 
very act of creating a histogram as an 
interpretation, not something you do before 
interpretation. 

Are the defaults that R uses for 
creating a histogram generally good? 

Generally, yes. R tries to figure out the 
number of breaks and the scale that will best 
represent the data, but R doesn't understand 
the meaning of the data it’s plotting. Just 
as with the summary functions, there’s 
nothing wrong with running a quick and dirty 
histogram to see what’s there, but before 
you draw any big conclusions about what 
you see, you need to use the histogram (and 
redraw the histogram) in a way that remains 
mindful of what you're looking at and what 
you hope to gain from your analysis. 

Are either of those humps the “bell 
curve?” 



That’s a great question. Usually, 
when we think of bell curves, we’re talking 
about the normal or Gaussian distribution. 
But there are other types of bell-shaped 
distributions, and a lot of other types of 
distributions that aren’t shaped like a bell. 

Then what’s the big deal about the 
normal distribution? 

A lot of powerful and simple statistics 
can come into play if your data is normally 
distributed, and a lot of natural and business 
data follows a natural distribution (or can be 
“transformed” in a way that makes it naturally 
distributed). 

So is our data normally 
distributed? 

The histogram you've been evaluating 
is definitely not normally distributed. As long 
as there’s more than one hump, there’s no 
way you can call the distribution bell-shaped. 

But there are definitely two humps 
in the data that look like bells! 


And that shape must have some sort 
of meaning. The question is, why is the 
distribution shaped that way? How will you 
find out? 

Can you draw histograms to 
represent small portions of the data to 
evaluate individually? If we do that, we 
might be able to figure out why there are 
two humps. 

That’s the right intuition. Give it a shot! 


Can you break the raise 
data down in a way that 
isolates the two humps 
and explains why they 
exist? 
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histograms 


Make histograms from 
subsets of your data 

You can make a histogram out of your entire 
data set, but you can also split up the data 
into subsets to make other histograms. 


I 灼 side youv- data av-c 
subsets of data -tha-t 

di-f-Pcv-c^-t gv-oups. 


l-f you pi 
values -fo 


lo*t 七 he raise 
ov eadk subset 
you mi# 七 yt a 
o( di-f-fcv-cir>*t shapes* 



TVic sV^apc o-f mc^s raises, -fov- example, 
tell VOU sornc^mj bv or m 


Let’s make a bunch of histograms that describe subsets of the raise data. Maybe looking at 
these other histograms will help you figure out what the two humps on the raise histogram 
mean. Is there a group of people who are earning more in raises than the rest? 

1) To start, look at this histogram command and annotate its syntax. What do you think its components mean? 

hist(employees$received[employees$year == 2007], breaks = 50) 

( _IA/v-1-tc dovm wV^a-t you 

■bWmk mca^s. 

2) Run the above command each of these commands. What do you see? The results are on the next page, 
where you’ll write down your interpretations. 

hist(employees$received[employees$year == 2008], breaks = 50) 
hist (employees$received [employees$gender == 、 'F’’] , breaks = 50) 
hist (employees$received [employees$gender ==, breaks = 50) 
hist(employees$received[employees$negotiated == FALSE], breaks = 50) 
hist(employees$received[employees$negotiated == TRUE], breaks = 50) 
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histograms six ways 



ExegciSe 


These histograms represent the raises for different subgroups of your employee population. 
What do they tell you? 


The hist () dommair>d 

makes a K*is*toyam. 



received is "the set o-p values you 
wahi plotted ih the 


hist(employees$received[employees$year 

These bv-adkc*b avc *tV>c subset J 


2007 


breaks 


50) 


>wh*.dV> c^radis —^ ^ Wis Y ouVc 

a subset o*f youv- data. ^ Co}rAs 讪州七 he y 阶 is ZOOl. 


Bv-caks av-c r\umbcv- 
bav-s *m b\t Wis-bojv-am. 


hist(employees$received[employees$year == 2008], 
breaks = 50) 


hist(employees$received[employees$gender == "F "], 
breaks = 50) 
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histograms 


hist(employees$received[employees$gender == "M "], 
breaks = 50) 



hist(employees$received[employees$negotiated == FALSE], 
breaks = 50) 



hist(employees$received[employees$negotiated == TRUE], 
breaks = 50) 



you are here ► 


273 





























































































digging deeper with groupings 



ExeRciSe - 

LutlOH 

You looked at the different histograms in search of answers to help you understand who is 
getting what raises. What did you see? 


hist(employees$received[employees$year == 2007], 
breaks = 50) 

This selects ohly raises -for 2-001 has -the sa^e basid 

sha^>e ds "the ori^'mal The sddle is diUeverrt — e g., ohly 0 people 

arc m -the largcs*t break here. But 七 he sha^c is 七 he sa^e, a^d -the 2-001 
group rni^h-t have *thc same dharad*tcvis*tids as -the overall group. 


hist(employees$received[employees$year == 2008], 
breaks = 50) 

Theses ihe c%ad*t same go'm^ oy\ here as wc see wi*tK -the 2-001 
data* R evch dhose *to plo*t *the da*bd us*m^ -the exa 匕七 sa^e sdale- A*t 
least as -far as -this data is dohdevhed, 2-001 av\d ZOOS arc pre*t*ty 
simildv*. 




hist(employees$received[employees$gender == "F "], 
breaks = 50) 

0v\tt we see bi^ hump a^d *thc little hump attached oh -the 
ri^lvt) although 七 he sddle is di-f-fc\rch*t oy \ 七 his his*to^rai^. This 
shows raises earned by y/omch by all *thc years rc^vcsch-tcd m -the daia ； 
so s d lo*t o-f 
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histograms 


hist(employees$received[employees$gender == "M "], 
breaks = 50) 

This looks a lo*t like *the -for -females. The Sddle is diHererrt, 

bu*t v/hcr\ you dourrt ihe bars, i*t looks like arc vou^hly *the sdrme 

hum bev c^f 9s v/ormer) *m *thc di-f-fc\rCh*t ^3*tcjorics. /\s usual, *thcvc 3VC 

*tv/o humps. 



hist(employees$received[employees$negotiated == FALSE], 
breaks = 50) 


Now here’s somethjus 七 oy\C hump. ^Y\d *thc horizjorrtal 
sdale shows 七 -these people — 七 he o^es who did 七 r^o*tia*te -their 
raises—arc oy\ -the low tv\d o-f *tKc raise va^gc. A^d *thc\rc are a lo*t o-f 
as you C^v\ see -from -the vertical sdale- 



hist(employees$received[employees$negotiated == TRUE], 
breaks = 50) 

M 

—— 

1 七 looks like spli*ttmf\ those who did dhd did no 七 he^o*tia*tc neatly 




separates -the *two humps. Here wc sec people edv*r\*m^ a lo*t r^orc *m raises, 

I s - 



dhd -there avc -far -fewer people. |七 looks like r\co^o*tia*tm^ -for a raise 


j 

. 

li mn fill-^ F 

^ives people a domaplctcly di-W-Cver\*t outcome dis*bribu*tior\. 

m 

m h 

« i* * ^ 
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negotiators get bigger raises 


Negotiation pays 


Your analysis of histograms of different subsets of 
the raise data shows that getting a larger raise is all 
about negotiation. 

People have a different spread of outcomes 
depending on their choice of whether to negotiate. If 
they do, their whole histogram shifts to the right. 


If you run the summary statistics on your negotiation 
subsets, the results are just as dramatic as what you 
see with the two curves. 



Noir>—jr>CJO-tia-to\rS "tc^cl 
"to lowcv- \raiscs. 





This is the -Puh^-tioh that 
the s-bhda\rd deviatioh. 


The dv>d mediae avc about 

*tV>C sdme y/rthm dis*tv*ibu*tioir>. 





suit™ ryCeniploye eslrai s^lanount tempJfjy-eesS negott cited = TRU E]) 
Min. 1st Qu, Median Mean 3pa Qu* Max. 

6.90 1^.30 11,m 11 . 02 (： 11*70 1490 

sdC^rnployeesIroise_amciunt[emp!oyees$negottcited ■■ TRU EJ) 

[1] 

sun™ ryC-employe e s S ra i se_amou n t [empl oy ee sS nego ti a ted _ FALSE]} 
Min + 1st Qlu Median Mean 3rd Qu # Mox + 

0.400 A 3m 5 ._ 5.006 5.700 

sdfemp \oy e-esirdise^amount[employees$ne 兑 o 七 tci 七 ed FALSE]]) 

[1] 1 观 1 的 


> 


0y\ avcv-ajc, oo*cn cn^ 

Vitw a smjlc ferwtay o\ me 


av-c 


ely 

negotiate your salary. 
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histograms 


What will negotiation mcaw for you? 


Now that you’ve analyzed the raise data, it 
should be pretty clear which strategies will 
have the best results. 




These arc youv s-tvaic^ics 


Ask for a little 


Ask for a lot 


Do 




9 


Its "to do hotliihg... 

i*P you doh't a big v-aisc/ 


The data sujjcsi ihai ^cjoiia-tio^ 

will itv\d io d\rcaic ihese ouidomes. 
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10 regression 


命命去 

♦ Prediction 

^ % 



IVe got all this data, but 
I really need it to tell me 
what will happen in the 
future. How can I do that? 


Predict it. 

Regression is an incredibly powerful statistical tool that, when used correctly, has the 
ability to help you predict certain values. When used with a controlled experiment, 
regression can actually help you predict the future. Businesses use it like crazy to help 
them build models to explain customer behavior. You’re about to see that the judicious 
use of regression can be very profitable indeed. 


this is a new chapter 
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swimming in cash 


Raises for people who did negotiate 


§ n 


What arc you going to 
do with all this money? 

Your quest for a raise paid off. With your 
histograms, you figured out that people who 
chose to negotiate their salaries got consistently 
higher outcomes. So when you went into your 
boss’s office, you had the confidence that you 
were pursuing a strategy that tended to pay off, 
and it did! 

These are the histograms you looked at in the 
final exercises of the previous chapter, except 
they’ve been redrawn to show the same scale 
and bin size. 

Nice work! 


YoiaV- boss y/3s 
*imp\rcsscdi you 





No point in stopping now. 

Lots of people could benefit from your insight about how 
to get better raises. Few of your colleagues took the savvy 
approach your did, and you have a lot to offer those who 
didn’t. 

You should set up a business that specializes in 
getting people raises! 
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regression 


Sharpen your pencil 


Here are a few questions to get you thinking about data-based 
ways of creating a business around your insights in salary 
negotiations. 


What do you think your clients would want from a business that 
helps them understand how to negotiate raises? 


If you ran such a business, what would be a fair way to 
compensate you for your knowledge? 
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spread the wisdom 

%iharpen your pencil 
^ Solution 


What sort of data-based compensation consulting business do 
you envision? 


What do you think your clients would want from a business that 
helps them understand how to negotiate raises? 

There are all sorts ways people -for a raise dould be helped ： wa^*t 

*to khow how *to dv-ess, how *bo *thmk dbou*b *thc issue -fv-or^ pev-spedtive o-f *thciv boss, y/ha*t words 

y/ill so-f*tch people up, dhd so -fo\rth- Bu 七 oy\C ^ues*tioy\ is how mudh do 1 ask -fov? 


If you ran such a business, what would be a fair way to 
compensate you for your knowledge? 

Clicr\*U v/ill you *to have By\ mdCh*tivc *bo mdke sure *tha*t *thciv tY^cr\tY\Ct y/orks ou*t v/cll So 
why ho*t dhareje *thcr^ a pcrdch-tajc o-f wha*t *thcy a^-budlly whch -they use your advice? Tha*t 
youv- 'mdCh*tivc is *to ^c*t bi^cs*t raise you t^v\ ^c*t them, ho*t to was*tc *theiv* *time. 


/ou\r dieirrt v\tcds you bo help hcv -figure 
ou*t whai so\rt o( v-aisc b> dsk -Pov-. 



y 雜 asks V^c\r boss (or d tt 

vd'ise level, Kcv boss Will resfo^d 4 a 


YouV" s\\Ct o-P *tV^C V"aisc* 
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regression 


Aw analysis that tells people 
what to ask for could be huge 

What amount of money is reasonable to ask 
for? How will a request for a raise translate 
into an actual raise? Most people just don’t 
know. 



You need a basic outline of your service so you 
know what you’re shooting for. What will your 
product look like? 
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what d J you reckon you can get? 


Pehold... the Raise Reckoner! 

People want to know what to ask for. And they 
want to know what they’ll get, given what they’ve 
asked for. 

You need an algorithm. 


And you’ve got everything you need to create 
a kick-butt decision procedure to help people 
get good raises. 


The al^ovi-thm IS some sov*t o( dcd*is*ioy> 
pvoteduve i\\ai says v/ha*t Will 

a*t d'i*f-fcv-cr>*t v-c^ucs-t levels. 



This is you\r actual pv-odudi 


People v/ill pay 
you -fov 七 his. 


Scholar’s Corner 



/\^y pv-odeduve you -Pollov/ -fco domplc*tc d 
dakula 七 icm. Wert, you’ll *take 七 he *to "the 七 。 *thc 

/\i^ouh*b Requested, dhd pcv--fo\rm some steps \y\ order 
七 o predi 匕七 \rcv/a\rdcdi. Bu*t steps? 
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regression 




You ky>oy/ 

people vcdcivcd, *too- 


What happens inside the algorithm? 

It’s all well and good to draw a pretty picture like this, but in order for 
you to have something that people are willing to pay for — and, just 
as important, in order for you to have something that works ~you’re 
going to need to do a serious analysis. 

So what do you think goes inside? 
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you must predict 


Inside the algorithm will be 
a method to predict raises 


Prediction is a big deal for data analysis. 
Some would argue that, speaking generally, 

hypothesis testing and prediction 

together are the definition of data analysis. 



av-c msa-tiablcf 




BULLET POINTS 

Things you might need to predict: 

■ People’s actions 

■ Market movements 

■ Important events 

■ Experimental results 

■ Stuff that’s not in your data 


Questions you should always ask: 

■ Do I have enough data to predict? 

■ How good is my prediction? 

■ Is it qualitative or quantitative? 

■ Is my client using the prediction well? 

■ What are the limits of my prediction? 


Let’s take a look at some data 

about what negotiators asked for. Can 
you predict what sort of raise you’ll 
get at various levels of requests? 
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regression 




Sharpen your pencil 


The histograms below describe the amount of money the 
negotiators received and the amount of money they requested. 


Do the histograms tell you what people should request in order to get a big raise? Explain how comparing the 
two histograms might illuminate the relationship between these two variables, so that you might be able to 
predict how much you would receive for any given request. 


7 -5 

_ _ _ _ _ _ 

09OUoocvlof § 09 0 

eldoedoJeqEnN 


7 -5 

I I I 1 I I 

OSCVJOOCVJOLOT-§ oLr)0 

eldoedo JeqEnN 
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By\d B bi^ v-c«\ucs*t 
dould yt a small v-aisc| 






segment your negotiators 


Sharpen your pencil 


Can you tell from looking at these two histograms how much 
someone should request in order to get the biggest raise? 


No. The his*tojvainr\s show spreads o-f sm^lc variables, bu 七 -they do〆 七 dd*budlly Compare |h order 
*bo khow how -these *bwo variables relate to eddh o*thcv-, y/c^d have *to see where smjlc 'mdividuals -fall 
oy\ bo*th 七 he requested dr\d received dis*bribu*tiohS. 


f[ small venues 七 dould yt a bi^ v-a'isc-- 



Requests made by negotiators 




iV the relationship to\AA be di-f-Pcvc^*t— bemuse requested 
ad received ploUcd you jusi do^i km>w. 


D 


there jctV 

)umb 


e no o 

Questions 


Can’t I just overlay two histograms 
onto the same grid? 

You totally can. But in order to make a 
good comparison, the two histograms need 
to describe the same thing. You made a 
bunch of histograms in the previous chapter 
using subsets of the same data, for example, 
and comparing those subsets to each other 
made sense. 


But Amount Received and Amount 
Requested are really similar, aren’t they? 

Sure, they’re similar in the sense that 
they are measured using the same metric: 
percentage points of one’s salary. But 
what you want to know is not so much the 
distribution of either variable but how, for a 
single person, one variable relates to the 
other. 


I get it. So once we have that 
information, how will we make use of it? 

Good question. You should stay 
focused on the end result of your analysis, 
which is some sort of intellectual “product” 
that you can sell to your customers. What 
do you need? What will the product look 
like? But first, you need a visualization that 
compares these two variables. 



oscg 





osl\ 02 

f09dM—0 J9qEnN 
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regression 



Scatterplot Magnets 


Remember scatterplots from chapter 4? They’re 
a great visualization for looking at two variables 
together. In this exercise, take the data from 
these three people and use it to place them on 
the graph. 


You’ll need to use other magnets to draw your 
scale and your axis labels. 


Bob requested 5% and received 5%. 
Fannie requested 10% and received 8%. 
Julia requested 2% and received 10%. 





Use 七 iVis %-y 
3%'is *bo flo 七 Bob, 

dv>di vJull3- 
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plotting your friends 




Scatterplot Magnets 


You just plotted Bob, Fannie, and Julia on the axis to create a 
scatterplot. What did you find? 


Bob requested 5% and received 5%. 
Fannie requested 10% and received 8%. 
Julia requested 2% and received 10%. 
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theve^cixe no regression 

Dumb Questions 


When can I use scatterplots? 

Try to use them as frequently as 
you can. They’re a quick way to show rich 
patterns in your data. Any time you have 
data with observations of two variables, you 
should think about using a scatterplot. 

So any two variables can be put 
together in a scatterplot? 

As long as the two variables are in 
pairs that describe the same underlying 
thing or person. In this case, each line of 
our database represents an instance of 
an employee asking for a raise, and for 
each employee, we have a received and a 
requested value. 


What should I look for when I see 

them? 

For an analyst, scatterplots are 
ultimately all about looking for causal 
relationships between variables. If high 
requests cause low raises, for example, 
you’ll see an association between the two 
variables on the scatterplot. The scatterplot 
by itself only shows association, and to 
demonstrate causation you’ll need more (for 
starters, you’d need an explanation of why 
one variable might follow from the other). 

What if I want to compare three 
pieces of data? 

You can totally create visualizations 


in R that make a comparison among more 
than two variables. For this chapter, we're 
going to stick with two, but you can plot 
three variables using 3D scatterplots and 
multi-panel lattice visualizations. If you’d 
like a taste of multidimensional scatterplots, 
copy and run some of the examples of the 
cloud function that can be found in the 
help file at help (cloud) ■ 

So when do we get to look at the 
2D scatterplot for the raise data? 

Right now. Here's some ready 
bake code that will grab some new, more 
detailed data for you and give you a handy 
scatterplot. Go for it! 




Run these commands inside of R to generate 

a scatterplot that shows what people 
requested and what they received. 


Make suve youVc Co^r\tt{,tA *to 七 he 

you V*uy> 七 his 

because *i*t pulls data -the Wleb. 



employees <- read.csv("http :// www.headfirstlabs 
hfda_chlO_employees•csv n , header=TRUE) 

head(employees,n=30) 


.com/books/hfda/ 



Th is I odds some 

r\t^i daia ar>d does〆 七 


display air>Y v-esulis. 


plot(employees$requested[employees$negotiated==TRUE ], 
employees$received[employees$negotiated==TRUE]) 



This displays s 匕七 . 


This donr»mahd will show you what s ih the 

data … always a ^ood idea -to take a look. What happens when you 


run these commands? 
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meet scatterplot 








































Scatterplots compare two variables 


Each one of the points on this scatterplot 
represents a single observation: a single 
person. 

Like histograms, scatterplots are another 
quick and elegant way to show data, and 
they show the spread of data. But unlike 
histograms, scatterplots show two variables. 
Scatterplots show how the observations are 
paired to each other, and a good scatterplot 
can be part of how you demonstrate 
causes. 



This dude askec 
1 % bui goi Z 07 o. ttc 
mus*t be ir»\po\rta^t 



Cpp^ 


The plot p\rodudcd 

sda*t*tc\rplo*t oy\ 



head(employees, n=30) 

plot(employees$requested[employees$negotiated==TRUE], 
employees$received[employees$negotiated==TRUE]) 




This asked -Pov 

v*cdcivcd ®%- 


Ti^c head dommaiad sV>o>ws 
vou daia bcloy/. 


ttevVs ou*tpu*t 
■the head 



This ^uy asked (or IZ% 
but Kad a pay du*t| 


丁 he head is 

«\uidk way -to peck msid 
any hew data you load- 



: r 



^ ^ € A ? 9 

-a^^ee-l^IJe-l^ae 1 

r MFMF HFM FM F 
c 
d 

a 
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regression 


To^c-tKcv, these 
do*ts vcfv-cscr>*t d\\ 
七 he 七 ia*tovs 

•m -the database. 



Of course you can, but why would you? 
Remember, you’re trying to build an 
algorithm here. 

What would a line through 
the data do for you? 




o o 
o 


OOOQ o 



ooocb 

o 



In OJ 



oo)eu$seeAC 


0 9J$s CD 9A0ldE CD 
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the meaning of a line 


% 





A line through the data could be a really 
powerful way to predict. Take another look at 
the algorithm you’ve been thinking about. 



A line could be this piece in the middle. If you 
had a line, you could take a Request value 
and then figure out the point on the line that 
matches a Received value. 

If it was the right line, you might have your 
missing piece of the algorithm. 



The raise Reckoner 

Tell me what ye shall ask, 
and I tell ye 
what ye shall receive. 



|s 七 iVis 七 he bcs*t I'mc -fov fvcdittioia? 

A line could tell your ] 

clients where to aim - 


S 3 03 gp 2Ln0 
Lu n ccl == p CD las ! oo)0u$s <D 8Aoldlu CD 】p <D >! <D o CD J$s 80 AOI Q.E <D 
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regression 



tt'mt look at tv^c dots m 
i\\t ^°/o rtwsitd v-a^c! 


Sharpen your pencil 


In order to figure out how to get the right line, why not just try 
answering a specific question about a single raise with your 
scatterplot? Here’s an example: 


If someone asks for a 8 percent raise, what is he likely to receive in return? See if 
looking at this scatterplot can tell you what sort of results people got from asking 
for 8 percent. 


Ike a good look at 
sdat-tcirplot -to 
this ^ucstioh. 




—20 


—15 


—10 


T 5 


【山 nDCl == pa)^!o6a)u$s9a)Aoldlue 】 p9>a)oal$sea)AoldEe 
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hn lo — 
① 

o 

E 

① 

o — 


5 10 15 


employees$requested[employees$negotiated 


If you take the mean of the Amount Received 
scores for dots in the 8 percent range (or strip), you 
get around 8 percent. On average, if you ask for 8 
percent, you get 8 percent. 


So you’ve solved the raise question for one group of 
people: those who ask for 8 percent. But other people 
will ask for different amounts. 


This is 七 he y-a 乂 is 
value -Pov v-cdciv'mj 
Bv\ 0% vaisc. 一 ^ 




is the employee 
who’s asking -Pov- Q°/o. 



This stv'ip is do*U Kavc 
values o( bc*t>wcc^ l ^°/o ar>d 总 •$%. 



What happens if 
you look at the 
average amount 
received for all 
the x-axis strips? 


% 

o 

o ① 

o 

o oo 


°o° 


o 

o o° 

o 

o 

o 

o _ 


o 

^ o 
00 

o 




o 


scatterplot strips 


^Sharpen your pencil 

Solution 


Using the scatterplot, how do you determine what an 8% raise 
request is likely to get you? 


Just take average amourrt vedcivcd -for dots around -the of a^ouh-t vc^ucs-tcd you’re 
lookmj a*t* l-f you look around 0 % oy\ 七 he y.—ax.is ( 七 he amount vc^ucs-tcd)) i*t looks like 七 he 
dov-\rcspor\d*mg do*ts oy\ y-a%is are about 8 %) *too. lake a look a*b gv-aph below. 


^ \\ 0 ^ 101 . CH 

【 Lu ncrl ==T3a)l(I!!oD)9u$s9aAoldE(I)lT3a)>!9o,(;1 




3 
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regression 


Predict values m each strip 
with the graph of averages 

The graph of averages is a scatterplot that 
shows the predicted y-axis value for each strip 
on the x-axis. This graph of averages shows 
us what people get, on average, when they 
request each different level of raise. 

The graph of averages is a lot more powerful 
than just taking the overall average. The overall 
average raise amount, as you know, is 4 percent. 
But this graph shows you a much more subtle 
representation of how it all shakes out. 


Hive’s 七 dvca*tcd 
*bo fv-cd*id*t likely value 
va'isc 



O 


o 



You’ve hit on the right line. 

Seriously. Draw a line through the 
points on the graph of averages. 

Because that line is the one you’re 
looking for, the line that you can use 

to predict raises for everybody. 
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regression line predicts 



The regression line predicts 
what raises people will receive 

Here you have it, the fascinating regression 
line. 

The regression line is just the line that best 
fits the points on the graph of averages. 

As you’re about to see, you don’t just have to 
draw them on your graphs. 

You can represent them with a simple equation 
that will allow you to predict the y variable for 
any x variable in your range. 


thereiare no o 

Dumb Questi9ns 


Why is it called a regression? 

The guy who discovered the method, 
Sir Francis Galton (1822-1911), was 
studying how the height of fathers predicted 
the height of their sons. His data showed 
that, on average, short fathers had taller 
sons, and tall fathers had shorter sons. He 
called this phenomenon “regression to 
mediocrity.” 

Sounds kind of snooty and elitist. 
It seems that the word “regression” has 
more to do with how Galton felt about 
numbers on boys and their dads than 
anything statistical. 


That’s right. The word “regression” is 
more a historical artifact than something 
analytically illuminating. 

We’ve been predicting raise 
amount from raise request. Can I predict 
raise request from raise amount? Can I 
predict the x-axis from the y-axis? 

Sure, but in that case, you’d be 
predicting the value of a past event. If 
someone came to you with a raise she 
received, you’d predict the raise she had 
requested. What’s important is that you 
always do a reality check and make sure 
you keep track of the meaning of whatever it 
is that you’re studying. Does the prediction 
make sense? 


Would I use the same line to 
predict the x-axis from the y-axis? 

Nope. There are two regression lines, 
one for x given y and one for y given x. Think 
about it. There are two different graphs of 
averages: one for each of the two variables. 

Does the line have to be straight? 

It doesn’t have to be straight, as long 
as the regression makes sense. Nonlinear 
regression is a cool field that’s a lot more 
complicated and is beyond the scope of this 
book. 


【 LLI ncrl p Q)lE! oo)au$s(l)9AoldE alp <I) >!90 <l) J$s <l) aAoldE <I) 
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You're forgetting something. 
Are you sure the line is 
actually useful? I mean, 
whafs it doing for ya? 


Make sure your line is actually useful. 

There are a lot of different ways the scatterplot 
could look, and a lot of different regression lines. 

The question is how useful is the line in jour scatterplot. 
Here are a few different scatterplots. Are the lines for each 
one going to be about as useful as the lines for any other? 
Or do some regression lines seem more powerful? 
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is your correlation linear? 


The line is useful if your data 
shows a linear correlation 

A correlation is a linear association between 
two variables, and for an association to be 
linear, the scatterplot points need to roughly 
follow a line. 



You can have strong or weak correlations, 
and they’re measured by a correlation 
coefficient, which is also known as r (not 
to be confused with [big] R, the software 
program). In order for your regression line 
to be useful, data must show a strong linear 
correlation. 

r ranges from -1 to 1, where 0 means 
no association and 1 or -1 means a perfect 
association between the two variables. 



These dots avc all ovcv fladc, 





Does your raise 
data sltow a linear 
correlation? 
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regression 


employees$requested[employees$negotiated == TRUE] 



k 刪 

■V 


Try using R (the program) to calculate r (the correlation 
coefficient) on your data raise. Type and execute this function: 


cor (employees$requested [employees$negotiated==TRUE ], 

employees$received[employees$negotiated==TRUE]) 


Annotate the elements of the function. What do you think they mean? 


How does the output of the correlation function square with your scatterplot? Does the 
value match what you believe the association between these two variables to be? 



【山 nDCl == p CD lBjlo a)<D u$s9 <D AoldLU9】p 0 >9 oCD J$s9 <D AolduJ CD 
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zooming in on correlation 



k 卿 Baw - 

KOpe You just told R to give you the correlation coefficient of your two variables. 

What did you learn? 

T ， r . I „ o These a\rc the two variables you 

， cor Wl.o, te Is R 厂 wa，t h> icsi U to^ {: J. 

to \rctu\r^ tnc 

o( -the tv/o variables. ICT'Ln 1.» ddv> see a I'meav- assod*iatioir> 

by look— a 七 *thc dV>av-*t- 



How does the output or tne correlation runction square witn your scatterpiotv 

Bo*th *thc \r-valuc and *thc sdatterplo 七 show a moderate do\r\rcla*tioh. |*t’s r\o*t 

j>cv--fcd*b wheve all *the fo*m*ts lme uf, bu 七 *there’s dc-fmi-tdy 3 I'mcav- assodia*tioh. 



How do you got the correlation coefficient? The 
actual calculation to get the correlation coefficient 
is simple but tedious. 

Here’s one of the algorithms that can be used to 
calculate the correlation coefficient: 


Carrelatian Up Cl^se 



TV 

d0VVclatl0ir> 



Sia^davd ur>*iis show \\o^i 
mdy>Y s*tay>davd deviations 
cadh value is *fv-om {\\t mca^- 
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regression 


tliereicire no o 

Dumb Questions 


I can see that a correlation of 1 
or -1 is strong enough to enable me to 
use a regression line. But how low of a 
correlation is too low? 

You just need to use your best 
judgment on the context. When you use 
the regression line, your judgments should 
always be qualified by the correlation 
coefficient. 

But how will I know how low of a 
correlation coefficient is too low? 

As in all questions in statistics and data 
analysis, think about whether the regression 
makes sense. No statistical tool will get you 
the precisely correct answer all the time, but 
if you use those tools well, you will know 
how close they will get you on average. Use 


your best judgment to ask, "Is this correlation 
coefficient large enough to justify decisions I 
make from the regression line?” 

How can I tell for sure whether my 
distribution is linear? 

You should know that there are fancy 
statistical tools you can use to quantify the 
linearity of your scatterplot. But usually 
you’re safe eyeballing it. 

If I show a linear relationship 
between two things, am I proving 
scientifically that relationship? 

Probably not. You’re specifying a 
relationship in a really useful mathematical 
sense, but whether that relationship couldn't 
be otherwise is a different matter. Is your 
data quality really high? Have other people 
replicated your results over and over again? 


Do you have a strong qualitative theory 
to explain what you’re seeing? If these 
elements are all in place, you can say you’ve 
demonstrated something in a rigorous 
analytic way, but “proof” might be too strong 
a word. 

How many records will fit onto a 
scatterplot? 

Like the histogram, a scatterplot is 
a really high-resolution display. With the 
right formatting, you can fit thousands and 
thousands of dots on it. The high-res nature 
of the scatterplot is one of its virtues. 



OK, OK, the regression 
line is useful. But here's 
a question ： how do I use 
it? I want to calculate 
specific raises precisely. 


You’re going to need a 

mathematical function in order 
to get your predictions precise... 
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equation for precision 



% is il^c %-a%*is value, ^\\\cM *m 

-this Cast \v\ *tV>C >wc ky>ow: 

va'isc dmou 灼七 v*c^ucs*tcd- 



employees$requested[employees$negotiated == TRUE] 


You need m equation to make 
your predictions precise 

Straight lines can be described algebraically 
using the linear equation. 


y is iht y - axis value ； whidh m 

■this ddse m 七 he th'mg wcVc 
i\ryi^5 io p\rcdidt ： raise v-cdcivcd- 


a 



Your regression line can be represented 
by this linear equation. If you knew what 
yours was, you’d be able to plug any raise 
request you like into the x variable and get a 
prediction of what raise that request would 
elicit. 

You just need to find the numerical values 
for a and b, which are values called the 

coefficients. 


a represents the y-axis intercept 

The first variable of the right side of 
the linear equation represents the y-axis 
intercept, where your line passes the y-axis. 


ttcv-c s IS 




If you happen to have dots on your 
scatterplot that are around x=0, you can just 
find the point of averages for that strip. We’re 
not so lucky, so finding the intercept might be 
a little trickier. 




LOCVJ02Lof 2\1 0 

LlJ nyl == P ① lB!o 60 u$s9 0AoldE CD lp CD > ao0 J$s9 CD AoldLU CD 
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regression 


b represents the slope 

The slope of a line is a measure of its angle. 
A line with a steep slope will have a large b 
value, and one with a relatively flat slope will 
have a b value close to zero. To calculate 
slope, measure how quickly a line rises (its 
‘rise,’’ or change in y-value) for every unit on 
the x-axis (its run). 


slope 


Once you know the slope and y-axis intercept 
of a line, you can easily fill those values into 
your linear equation to get your line. 



cnLOo 

dluCD JP CD>!CDoCD J$s CDCD AOICILU CD 
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objectifying regression 


Tell R to create a regression object 

If you give R the variable you want to predict 
on the basis of another variable, R will 
generate a regression for you in a snap. 


Behind 
the Scenes 



The basic function to use for this is lm, which 
stands for linear model. When you create a 
linear model, R creates an object in memory 
that has a long list of properties, and among 
those properties are your coefficients for the 
regression equation. 


ttevVs a lis 七 o*f all 

fvofcvtics R dvca*tcs 

ms'ide youv* Imcav* model- 
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Watch it! 


No software can tell you 
whether your regression 
makes sense. 


R and your spreadsheet program 
can generate regressions like 
nobody’s business, but it’s up to you to make 
sure that it makes sense to try to predict one 
variable from another. It’s easy to create 
useless, meaningless regressions. 


306 Chapter 10 
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ExGRciSe 


Try creating your linear regression inside of R. 



Run the formulas that create a linear model to describe your data and display the 
coefficients of the regression line. 


myLm <- lm (received [negotiated==TRUE] 〜 requested [negotiated==TRUE], 
data=employees) 

myLm$coefficients 



Using the numerical coefficients that R finds for you, write the 
regression equation for your data. 


y = a + bx 


ttevVs the slope. 




you are here ► 


307 







find the slope 



What formula did you create using the coefficients that R calculated? 


ExeRciSe 

SotuiloH 


o 


Run the formulas that create 
a linear model to describe 
your data and display the 
coefficients of the regression 
line. 


❻ 


Using the coefficients that R 
regression equation like this. 



Raise vc^civcd 


ttcvc^s *tiic m 七 evdef 七 . 






This is youv vcycssio^ -fovmula! 


Raise vc<\ucs*tcd 


the slope- 


龜 Gee} - 

How did R calculate the slope? It turns out that the slope of the regression line is equal to 
the correlation coefficient multiplied by the standard deviation of y divided by the standard 
deviation of x. 


b = r* 

This c^uatior> 
ddl^uldies 七 slope 

o-f *tv>c v-c^vcssio^ l*mc- 


<J y /CT x 




Ugh. Let’s just say that calculating the slope of a regression line is one of those tasks that 
should make us all happy we have computers to do our dirty work. These are pretty elaborate 
calculations. But what’s important to remember is this: 

As long as you can see a solid association between your two variables, and as long as 
your regression makes sense, you can trust your software to deal with the coefficients. 
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regression 


The regression equation goes hand 
iw hand with your scatterplot 


Take the example of the person who wanted 
to know what sort of raise he’d receive if he 
asked for 8 percent. A few pages back, you 
made a prediction just by looking at the 
scatterplot and the vertical strip around 8 
percent on the x-axis. 

Hcv-c s i\\t juy ask -fov 0 /o. 


The regression equation your found with the 
help of the lm function gives you the same 
result. 


y 


2.3 + 0.7x 
2.3 + 0.7 * 8 



HcvVs whai ihc vcjv-cssio^ 

c«\uaiio^ predicts he’ll v-cdcivc- 



So what is the Raise Reckoner? 


You’ve done a lot of neat work crafting a 
regression of the raise data. Does your regression 
equation help you create a product that will 
provide crafty compensating consulting for your 
friends and colleagues? 



You still -f illed iK'is 

pav-*t o( youv- al^rrtW 
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raise reckoner complete 


The regression equation is 
the Raise Reckoner algorithm 

By taking a hard look at how people in the 
past have fared at different negotiation levels 
for their salaries, you identified a regression 
equation that can be used to predict raises 
dven a certain level of request. 


uv diwts y/*ill use 
e'ud 七 10 灼 * to ddldulatc 

-tlic'iv v-a'isc- 


your dierrb 
ill yt raises ’m I'mc 
\n\i\\ or yrtaitr *t^an 
your 



This equation will be really valuable to people 
who are stumped about how much of a raise 
to request. It’s a solid, data-based analysis of 
other people’s success in getting more money 
from their employers. 

Using it is a matter of simple arithmetic in 
R. Say you wanted to find out what sort of 
raise can be expected from a 5 percent request. 
Here’s the code. 


variable 

my_raise h> 


,..vu 灼 my—raise *tKv-ou^V> 
youv* v-c^v*css'ioir> e'ud 七 I 。灼 … 



d hc\rc you have i*t| 
i v-aisc -fv-om 

\rc<\ucs*t is 弓 .0%. 
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regression 


How do I know that what people 
ask for tomorrow will be like they 
received today? 

That’s one of the big questions in 
regression analysis. Not only “Will tomorrow 
be like today?” but "What happens to my 
business if tomorrow is different?" The 
answer is that you don’t know whether 
tomorrow will be like today. It always might 
be different, and sometimes completely 
different. The likelihood of change and its 
implications depend on your problem domain. 


there 】 ore n9 o 

Dumb Questi9ns 


So why bother even trying to 
predict behavior? 

In the online world, for example, a 
good regression analysis can be very 
profitable for a period of time, even it stops 
producing good predictions tomorrow. Think 
about your own behaviors. To an online 
bookseller, you’re just a set of data points. 




That’s kind of depressing. 




How so? 


Well, compare medical data versus 
consumer preferences. How likely is it that 
the human body, tomorrow, will suddenly 
change the way it works? It's possible, 
especially if the environment changes 
in a big way, but unlikely. How likely is it 
that consumer preferences will change 
tomorrow? You can bet that consumer 
preferences will change, in a big way. 


Not really—it means the bookseller 
knows how to get you what you want. You’re 
a set of data points that the bookseller runs 
a regression on to predict which books 
you'll want to buy. And that prediction will 
work until your tastes change. When they 
do, and you start buying different books, the 
bookseller will run the regression again to 
accommodate the new information. 

So when the world changes and 
the regression doesn’t work any more, I 
should update the it? 


Again, it depends on your problem 
domain. If you have good qualitative reasons 
to believe that your regression is accurate, 
you might never have to change it. But if 
your data is constantly changing, you should 
be running regressions constantly and using 
them in a way that enables you to benefit if 
the regressions are correct but that doesn’t 
destroy your business if reality changes and 
the regressions fail. 

Shouldn’t people ask for the raise 
they think they deserve rather than the 
raise they see other people getting? 

That’s an excellent question. The 
question is really part of your mental model, 
and statistics won't tell you whether what 
you’re doing is the right or fair approach. 
That’s a qualitative question that you, the 
analyst, need to use your best judgment in 
evaluating. (But the short answer is: you 
deserve a big raise!) 






Meet your first clients! Write down what sort of raise you think is appropriate for them to request, 
given their feelings about asking, and use R to calculate what they can expect. 
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you're in business 






nciSe 

iytlOH 


What did you recommend to your first two clients, and 
what did R calculate their expected raises to be? 



W/hy r\o*t ask -for ^%? Tha*t’s oh 
low e^d *the s^ale- 


/\ lo\w vaisc venues 七 be 


SomcohC who asked -Pov- 1 % would 
exp^t b> get airouhd 午.午％ m vcWh. 


rm ready to go all out. I 
want a double digit raise! 


You have pidked 

r>umbcv*s *th3>v *tKcsc- 


A rworc aggressive raise rc^ucs-t 
would be -for 15%. 



A higher \raisc \rc<\ucst 
be /^7o. 


n Q 0 r 兰 • ft 籠 a 


Q. 



> personZ <- 15 < 

> 2,3 -h 0,7*person2 


> 




Someone who dsked -fo\r 1 今 % would 


expe 匕七 *to yt d\round IZ.0% m vc*tu\r^. 


Let’s see wkat kappenect. 
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regression 


Your raise predictor didn't 
work out as planned... 


People were falling all over themselves to get 
your advice, and you got off your first round 
of recommendations smoothly. 


But then the phone started ringing. Some 
of your clients were pleased as punch about 
the results, but others were not so happy! 


O 


I got 5%! Tm definitely 
satisfied. Good for you. 
The checks in the mail! 




o 


Looks like 七 his 
did jus-t -Pmc/ 


OY\C 



0 


12.8%? Man, I got 0.0%. 
Hope you know a good 
lawyer! 



This didn't 

pa 内 ou*t so >wcll *fov iVim. 


Pid *tW»s 





What 



requested, 


What did your clients do with your 
advice? What went wrong for those who 
came back unhappy? 

You’ll have to get to the bottom of this 
situation in the next chapter... 
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11 ert9t 



% 

♦ Err Well 




That went just as 
I had planned. 






The world is messy. 

So it should be no surprise that your predictions rarely hit the target squarely. But if you 
offer a prediction with an error range, you and your clients will know not only the average 
predicted value, but also how far you expect typical deviations from that error to be. Every 
time you express error, you offer a much richer perspective on your predictions and 
beliefs. And with the tools in this chapter, you’ll also learn about how to get error under 
control, getting it as low as possible to increase confidence. 


this is a new chapter 
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angry customers 


Your clients arc pretty ticked off 

In the previous chapter, you created a linear 
regression to predict what sort of raises 
people could expect depending on what they 
requested. 

Lots of customers are using the raise algorithm. 

Let’s see what they have to say. 



I got a 4.5% raise. It was a good 
raise. I think thafs the sort of 
raise I deserved. I was so nervous 
in the meeting that I can’t even 
remember what I asked for. 


I can’t believe it! I got a 5.0% 
bigger raise than the algorithm 
predicted! My negotiation must 
have scared my boss, and he just 
started throwing money at me! 





o 


Yeah, I got no raise. Did you hear 
that? 0.0% I have some ideas for 
you about what you can do with 
your algorithm. 



^4 


Tm pretty pleased. My raise was 
0.5% lower than expected, but 
ifs still a solid raise. I’m pretty 
sure I wouldn’t have gotten it if I 
hadn’t negotiated. 


0 




O 


o 


Bull's-eye! I got the 
exact raise the algorithm 
predicted. I*m telling you, 
it's incredible. You must 
be some sort of genius. 
You rock my world. 


0 
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error 


What did your raise 



Everyone used the same formula, which was 
based on solid empirical data. 

But it looks like people had a bunch of 
different experiences. 




What happened? 


■r^Sbarp€n your pencil 


The statements on the facing page are qualitative data about the 
effectiveness of your regression. 


How would you categorize the statements? 
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diversity of outcomes 


^harpen your pencil 

Solution 


You looked closely at your customers'qualitative responses to 
your raise prediction algorithm. What did you find? 



The s*ta 七 emeirrb. 


Bull's-eye! I got the exact raise the algorithm 
predicted. I’m telling you, it's incredible. You must 
be some sort of genius. You rock my world. 

This oy\C s spoi oJ 

I’m pretty pleased. My raise was 0.3% 
lower than expected, but it’s still a solid 
raise. I’m pretty sure I wouldn't have 
gotten it if I hadn't negotiated. 

This ohc got a \raisc that was dose 
but hot exactly “at you pvedidted. 

Yeah, I got no raise. Did you hear 
that? 0.0% I have some ideas for 
you about what you can do with 
your algorithm. 


I cant believe it! I got a 5.0% 
bigger raise than the algorithm 
predicted! My negotiation must 
have scared my boss, and he just 
started throwing money at me! 



|*t looks like -there av-c basidally -three -types \rcspor\sc, 
qualitatively speaking. Ov\t o-f -them jo*t c%ad*tly wha*t 

*thc algo\ri*thm predicted. ^Y\o{\\tr rcdcivcd a raise *tha*t 

y/ds a li*t*tlc o-f-f, bu 七 still dlosc *to 七 he pvcdi^*tioh. Two 

of jo*t raises 七 y/eve y/ay oK. f[v\d las 七 

ohc, y/elt unless -thcvc^s a *brchd o*f people who 

rcmcimbcr wha*t -they rc^ucs-tcd *thcrc ； s pvobably r\o*t 
mudh you ddh make i*t- 

This oy\CS jus 七 >wc'iv-d- Its k*md 

o( i^av-d vo dv-a>w ar>y 乙 o 灼 dusio 灼 

o^-f a sta 七 emeivt like -this. 


These tv/o appear 
■to be way oW. 





I got a 4.5% raise. It was a good raise. I 
think that's the sort of raise I deserved. 
I was so nervous in the meeting that I 
cant even remembered what I asked for. 



The segments of customers 

Remember, the regression equation predicts 
what people will hit on average. Obviously, 
not everyone is going to be exactly at the 
average. 

Youv vcspohscs 


bu 七 。 k 


o((l 
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error 



Let’s take a few more responses from your clients. The ones below are a little more specific than 
the previous ones. 

Draw arrows to point where each of these people would end up if you plotted their request/raise 
experiences on your scatterplot. 




pv-av/ av"\rov/s *to s \\ o^j 

people would 

siioy/ uf on *tV^c sutt 代 f 1 。七 . 
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visualize your new outcomes 


This pc\rsoh shov/s 
up above 七 he 
\rcg\rcssioh lihC Oh 

"the -fa\r lc-f*t 
the dha\rt. 



5 10 15 20 

Requested 



This pc\rsoir» doCSir /七 sliov/ OY\ 
*bKc s^a*t*tcv*plo*t a*t all. 





(U：iSe 

utioH 


You just added new dots to your scatterplot to describe where three of your 
customers would be shown. What did you learn? 



I requested 8%, 
and I got 7%. 



TWis yev-so^ y/ould 
sKoy/ u\> 七 m 
-b^c middle o^c *t^C 
bi^cs*t dlum^p 

obse\r\/a*tio 於 . 



s(\lOCM91—CU S 0 

piCDooy 


320 Chapter 11 















error 


The guy who asked for 13 % 
wewt outside the model 

Using a regression equation to predict a 
value outside your range of data is called 
extrapolation. Beware extrapolation! 


Tke regression line 
points to otlivion. 





You don’t know what’s going on 
out here. Maybe if you had more data, 
you could use your equation to predict 
what a bigger request would get. 


But you’d definitely have to run your 
regression again on the new data to 
make sure you’re using the right line. 


Extrapolation is different from 
interpolation, where you are predicting 
points within your range of data, which 
is what regression is designed to do. 
Interpolation is fine, but you should be leery 
of extrapolation. 

People extrapolate all the time. But if 
you’re going to do it, you need to specify 
additional assumptions that make 
explicit your ignorance about what happens 
outside the data range. 


!>*tcv-pola*tm5 is \us*t 
y/rtw -these bounds. 





What would you say to a client 
who is wondering what he 
should expect if he requested a 
30% raise? 
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people skills 


How to handle the client who wawts 
a prediction outside the data range 


You’ve basically got two options when your 
clients want a prediction outside your data 
range: say nothing at all, or introduce an 
assumption that you can use to find a prediction. 


Say nothing: 


No comment. If you 
ask for 25%, I have no 
idea what will happen. 


Which of these responses would be 
more beneficial to your client? The 
second one might satisfy your client’s 
desire to have a specific prediction, but 

a crummy prediction might be 
worse than no prediction. 



Use an assumption to 
make a prediction: 




The data won’t really tell us, but 
this has been a lucrative year, so a 
30% request is reasonable. I think 
it'll get you 20% or so. 




assumption 
you use "to 

make fvcd*ittiov>. 


V^u may o\r may hot 
have good \rcasoh -to 
Relieve this assumpti, 


So what exactly might happen 
outside the data range that’s such a 
problem? 

There might not even be data outside 
the range you're using. And if there is, the 
data could look totally different. It might even 
be nonlinear. 

I won’t necessarily have all the 
points within my data range, though. 

You’re right, and that’s a data quality 
and sampling issue. If you don’t have all the 


thereiare no o 

Dumb Questi9ns 

data points—if you’re using a sample—you 
want to make sure that the sample is 
representative of the overall data set and is 
therefore something you can build a model 
around. 

Isn’t there something to be said for 
thinking about what would happen under 
different hypothetical, purely speculative 
situations? 

Yes, and you should definitely do it. 

But it takes discipline to make sure your 
ideas about hypothetical worlds don’t spill 
over into your ideas (and actions) regarding 
the real world. People abuse extrapolation. 


Isn’t any sort of prediction about 
the future a type of extrapolation? 

Yes, but whether that’s a problem 
depends on what you’re studying. Is what 
you’re looking at the sort of thing that could 
totally change its behavior in the future, or 
is it something that is pretty stable? The 
physical laws of the universe probably aren’t 
going to change much next week, but the 
associations that apparently explain the 
stock market might. These considerations 
should help you know how to use your 
model. 
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error 



t V . S * i f And when you're looking at anyone else 's 
W4lvn II* models, always think about how reasonable 

their assumptions are and whether they might 
have forgotten to mention any. Bad assumptions can make 
your model completely useless at best and dangerously 
deceptive at worst. 


Always keep an eye on your model 
assumptions. 



BE _制 

Look at this list of possible assumptions 
for tire Raise Reckoner, ^ow ni^ht each of 
these change your model, if it were true? 


Economic performance has been about the same for all years in the 
data range, but this year we made a lot less money. 


One boss administered all the raises in the company for the data we 
have, but he’s left the company and been replaced by another boss. 


How you ask makes a big difference in what kind of raise you get. 


The spread of dots in the 20-50 percent range looks just like the 
spread of dots in the 10-20 percent range. 


Only tall people ask for raises. 
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think about assumptions 




BE th^ _制 

Look at this list of possible assumptions 
for tire Raise Reckoner. 吨 ht each of 

these change your model, if it were true? 

Economic performance has been about the same for all years in the 
_ data range, but this year we made a lot less money. 

This yearns raises dould be dovm, oh average. The model m •吵七 ho 七 work. 

One boss administered all the raises in the company for the data we 
Yokes' TKaVd be -the p I f have, but he’s left the company and been replaced by another boss. 

your busir.cssj at leasi uncil Y ou ^jalvt *th*mk di-f-fc\rCh*tly break *thc model 

have data oy\ tnc v\ty ^uyf .v. .4.. •. m . l . 

How you ask makes a big difference in what kind of raise you get. 

— ^ This is surely *brue, av\d -the data v-c-flcdts *tKc variation, so -the model’s 0^ 

You doh’t have data oh hov The spread of dots in the 20-50 percent range looks just like the 

■to ask -fov mohey... -the model spread of dots in the 10-20 percent range. 

just says what voull get oh , 

average at diHcvcht vc<\ucsts. I^f *this were *brue, you d be able *to e%*tv-apola*tc *thc \rcj\rcssioh c^ua*tioh. 



Only tall people have asked for raises in the past. 

Ifte y/eve *brue, model wouldn't apply *bo sKo\rtcv people. 

Shovtcv- people 
do bc*t*tc\r oy y/ovsc 
*tha^ tallcv- people- 



Now that you’ve thought through how 
your assumptions affect your model, 
you need to change your algorithm 

so that people know how to deal 
with extrapolation. 
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error 


Sharpen your pencil 


youv davcat abou*t 
cx.-tvafolatio^ hcv-c. 


You need to tweak your algorithm to instruct your clients to avoid 
the trap of extrapolation. What would you add? 



The Raise Reckoner 


What will happen if we request a certain 
amount of money? Find out with this equation: 


y=2.3+0.7x 


Where x is the amount requested, and y 
is the amount we can expect to receive. 



Payoffs for negotiators 



■ 

ml 


How would you explain to your clients that they need 
to avoid extrapolation? 
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a new raise reckoner 


%Sbarp€n your pencil 

Solution 


How did you modify your compensation algorithm to ensure that 
your clients don’t extrapolate beyond the data range? 



The Raise Reckoner 


What will happen if we request a certain 
amount of money? Find out with this equation: 


y=2.3+0.7x 


Wc\rcs the 
you heed to add. 


c 


YouV* V*C^V*CSSIOir> v/lll 

y/ov*k "tK'is v*3ir>^C- 


Where x is the amount requested, and y 
is the amount we can expect to receive. 

Bu 七 -formula or\ly y/o\rks i-f your requested 

amouh*t M is brtwcch 0% a^d ZZ%. 


How would you change the algorithm to 
tell your clients to avoid extrapolation? 




Youv- da 

ohly extends "fco 


Beyond a ZZ7o request, you 
da 的’七 say >wV>a*t Will 


Because you or\ly have da*bd -for people wKo 


c^ucs-t ZZ% o\r less \y\ raises, your \rcjvcssior\ 


ohly applies *to rc^ucsis brbween 0% ar\d ZZ%. 
Your dierrts C^y\ ask -for more—ay\d maybe -they 

make a lo*t o-f mohcy i-f -they do—bu*t as -fa\r as 
youVc dor\dcvhcd, ■they^d be -theiv ovm. 
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error 


The guy who got fired because 
of extrapolation has cooled off 



With your new-and-improved regression 
formula, fewer clients will run with it into the 

land of statistical unknowns. 

So does that mean you’re finished? 
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more tweaks needed 


You've only solved 
part of the problem 

There are still lots of people who got screwy 
outcomes, even though they requested raise 
amounts that were inside your data range. 

What will you do about those folks? 


This 30 七 move 七 he 3skcd 

ov by a fvetty mav*—. 



WWf do you thihk he got 

10% v-atkv iha^ ^%? 




asked for a ⑽ _ amount a^d y>*b 
jus-t a Wii\t bit less sV^c 代，細 . 
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People >wh。-Pali below youv- 

V*C^V*CSSlOir> pvcd'l£.*tlOir>S dVC 

s*t*ill pvc*t*ty ticked o^-f. 


Wkat coulct te 
causing tkese 
cteviations from 
your prectiction? 


What does the data for the 
screwy outcomes look like? 

Take another look at your visualization and 
regression line. Why don’t people just get what 
they ask for? 

Wow do you c^plaih people who v-cdcivcd 
rwo\re "than -the model pvedi^ts? \ 


Payoffs for negotiators 



You’ve 從⑽七 ed (<yr 
people v/iio asked -fo\r 
move i^Y\ 2-0% m v-aiscs. 


O 


O 


o 

o 


o 

o 
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what are the chances? 


Chance errors arc deviations 
from what your model predicts 


You’re always going to be making 
predictions of one sort or another, 
whether you do a full-blown regression or 
not. Those predictions are rarely going 
to be exactly correct, and the amount by 
which the outcomes deviate from your 
prediction is called chance error. 



Analysis 

What will the archer hit? 



Prediction 



Outcomes 


Time 


In statistics, chance errors are also 
called residuals, and the analysis of 
residuals is at the heart of good statistical 
modeling. 


Here’s ov\t 
res'idvidl* 


While you might never have a good 
explanation for why individuals residuals 
deviate from the model, you should 
always look carefully at the residuals on 
scatterplots. 

If you interpret residuals correctly, you’ll 
better understand your data and the use 
of your model. 


TVis obscv-vatio\r\ IS 

about 8% Wi^cv- o 
model 



You’ll always kave ckance 
errors in your predictions, 
and you migkt never learn 
wky tkeyVe in your ctata. 




O CVJLo l cn s 

\pe>8ey 
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error 


r ^harpen your pencil_ 

Better refine your algorithm some more: this time, you should 
、 probably say something about error. 

Here are some possible provisions to your algorithm about 
chance error. Which one would you add to the algorithm? 

"You probably won’t get what the model "Your results may vary by a margin of 20 percent 

predicts because of chance error.” more or less than your predicted outcome.” 


"Only actual results that fit the model results "Please note that your own results may vary 

are guaranteed." from the prediction because of chance error." 



The pvov*is*ioy> you 
fvc-fcv- Will 50 V>cv-c- 



The raise reckoner 


What will happen if we request a certain amount 
of money? Find out with this equation: 



Where x is the amount requested, and y is 
the amount we can expect to receive. 

But the formula only works if your requested 
amount (x) is between 0% and 22%. 
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raise reckoner with error 


Sharpen your pencil 

Solution 


You refined the algorithm to incorporate chance errors. What 
does it say now? 


"You probably won’t get what the model 
predicts because of chance error.” 

This is *brue. Probably or\ly a -few people 
will c%ad*tly wha*t c«\ua*tioh 
\rc*tu\nr\s. Bu*t i 七 wor /七 be a very sa*tis-fyihj 
C%plaha*tioh -for *thc dierrt. 

'Only actual results that fit the model results 
are guaranteed.” 

This is jus*t souhdm^ hohschsc. 

Your results arc only juarah-tccd i-f -they 
-fi*t model pv-edidtioh? Wt\\ wKat i-f 
■they dor / 七？ Tha*t’s just silly. 


"Your results may vary by a margin of 20 percent 
more or less than your predicted outcome.” 

|*t’s ^ood *to spedi-fy error ^uah*ti*ta*tivcly. 
Bu 七 wha*t reason do you have to believe 
■the 2.0% -figure? i*t ，s *brue, 

wouldn't you wah*t less c\r\ro\r *tha*t? 

"Please note that your own results may vary 
from the prediction because of chance error.” 

True, r\o*t *tcvribly sa*tis-fy*mj. Urrtil have 
some more power-ful -tools, -this s*ta*tcmCh*t 
will have *to do. 



Hcv-c^s tavc3*t 
abou*b cv-vov. 



The raise Reckoner 


What will happen if we request a certain amount 
of money? Find out with this equation: 



Where x is the amount requested, and y is 
the amount we can expect to receive. 

But the formula only works if your requested 
amount (x) is between 0% and 22%. 

Please hotc ?.，. ^csul-ts may vary 

Jp\rom tiic f^cdidtion bc4^. u . s .^ 



332 Chapter 11 




































error 







Poes 七 lie fvcscir>dc o( 

CV"V*OV* 

a^dlys'lS IS CV*VOV>COUS? 


You just lost all your clients. 

Hate to break it to ya, but your whole 
business has just fallen apart. That last 
line on your compensation algorithm 
was the difference between people 
feeling like you were helping them and 
people feeling like your product was 
worthless. 


How are you going to 
iix your product? 
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embrace your error 


Error is good for you 
and your client 


The more forthcoming you are about 
the chance error that your clients should 
expect in your predictions, the better off 
both of you will be. 

Your product 


r 


Your clients 




y/hc\rc you\r 
d’wb have b ⑽ 

up uirrtil Y\O^I- 


Specifying error does not mean that 
your analysis is erroneous or wrong. It’s 
just being honest about the strength of 
your predictions. And the more your 
clients understand your predictions, the 
more likely they’ll be to make good 
decisions using them. 



Here’s whc\rc youv- 

-fco be 


Let’s specify error 
quantitatively... 
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error 



CHstiftCC ^vvov 

This week’s interview: 
What are the chances? 


Head First： Man, you’re a pain in the butt. 

Chance Error: Excuse me? 

Head First： It’s just that, because of you, regression 
will never really be able to make good predictions. 

Chance Error： What? I’m an indispensable 
part of regression in particular and any sort of 
measurement generally. 

Head First: Well, how can anyone trust a regression 
prediction as long as you’re a possibility? If our 
clients want to know how much money they’ll get 
when they request a raise, they don’t want to hear 
from us that it’s always possible (or even likely!) that 
what they get will be different from what the model 
predicts. 

Chance Error： You’ve got me all wrong. Think of 
me as someone who’s always there but who isn’t so 
scary if you just know how to talk about me. 

Head First: So “ error” isn’t necessarily a bad word. 

Chance Error： Not at all! There are so many 
contexts where error specification is useful. In fact, 
the world would be a better place if people did a 
better job expressing error often. 

Head First： OK, so here’s what I’m saying to 
clients right now. Say someone wants to know what 
they’ll get if they ask for 7 percent in a raise. I say, 
“The model predicts 7 percent, but chance error 
means that you probably will get something different 
from it.” 

Chance Error： How about you say it like this. If 
you ask for 7 percent, you^l probably get between 6 
percent and 8 percent. Doesn’t that sound better? 

Head First： That doesn’t sound so scary at all! Is it 
really that simple? 

Chance Error： Yes! Well, sort of. In fact, getting 


control of error is a really big deal, and there’s a 
huge range of statistical tools you can use to analyze 
and describe error. But the most important thing 
for you to know is that specifying a range for your 
prediction is a heck of a lot more useful (and truthful) 
than just specifying a single number. 

Head First: Gan I use error ranges to describe 
subjective probabilities? 

Chance Error： You can, and you really, really 
should. To take another example, which of these 
guys is the more thoughtful analyst: one who says he 
believes a stock price will go up 10 percent next year, 
or one who says he thinks it’ll go up between 0—20 
percent next year? 

Head First： That’s a no-brainer. The first guy can’t 
seriously mean he thinks a stock will go up exactly 10 
percent. The other guy is more reasonable. 

Chance Error: You got it. 

Head First： Say, where did you say you came from? 

Chance Error： OK, the news might not be so good 
here. A lot of times you’ll have no idea where chance 
error comes from, especially for a single observation. 

Head First： Seriously, you mean it’s impossible 
to explain why observations deviate from model 
predictions? 

Chance Error： Sometimes you can explain some 
of the deviation. For example, you might be able 
to group some data points together and reduce the 
chance error. But I’ll always be there on some level. 

Head First： So should it be my job to reduce you as 
much as possible? 

Chance Error： It should be your job to make your 
models and analyses have as much explanatory and 
predictive power as you can get. And that means 
accounting for me intelligently, not getting rid of me. 
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errors as numbers 


Specify error quantitatively 


It’s a happy coincidence if your observed 
outcome is exactly what your predicted 
outcome is, but the real question is what 
is the spread of the chance error (the 

residual distribution). 

What you need is a statistic that shows 
how far typical points or observations 
are , ⑽ average, from your regression line. 


The sfvead or disVibu 七 10 灼 
\residuals av~ouir>d 

vc^v-css*ior> I me says a 
lo*t abou 七 youv* model- 



丁 he tighten you\T obsc\rvatiohS avc 
airouhd you\r \rcg\rcssioh lihC, the 
MO 代 powcdul youv lihC will be. 


Definitely. The distribution of chance error, 
or R.M.S. error, around a regression line is 
a metric you can use just like the standard 
deviation around a mean. 

If you have the value of the R.M.S. error for 
your regression line, you’ll be able to use it to 
explain to your clients how far away from 
the prediction typical outcomes will be. 
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error 



Payoffs for negotiators 



TKc sidy>ddrd deviation 

desdvibes *tV>c sfvead 
av~ouy>d *tKc 


The R./W.S. CV-\ro\r 
des 匕 \ribes 七 he spv-cad 
■fv-orw -the v-cgvcssio^ I'me- 


So how do you 
calculate the 
R.M.S. error? 


r 

R.M.S. cvv-ov 
vc-fcvs *to 
•the vclation 
bc*t>wccir> *t>wo 
vav-iablcs. 




Remember the units that you use for 
standard deviation? They’re the same 
as whatever^ being measured: if your 
standard deviation of raises received is 
5 percent, then typical observations will 
be 5 percent away from the mean. 

It’s the same deal with R.M.S. error. If, 
say, your R.M.S. error for predicting 
Received from Requested is 5 percent, 
then the typical observation will be 5 
percent away from whatever value the 
regression equation predicts. 



Quantify your residual distribution 
with Root Mean Squared error 




OOCVJog l - 00^. 09 

CDldoedoJeqEnN 
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r r s got your back 


Your model m R already 
knows the R.M.S. error 


The linear model object your created inside of R in the 
last chapter doesn’t just know the y-axis intercept and 
slope of your regression line. 


It has a handle on all sorts of statistics pertaining to your 
model, including the R.M.S. error. If you don’t still have 
the myLm object you created in R, type in this function 
before the next exercise: 


Make su\rc you have 
mos*b loaded- 



employees <-read. csv (''http: / /www. headf irstlabs . com/books/hfda/ 
hfda_chlO_employees.csv ”， header=TRUE) 

myLm <- lm(received[negotiated==TRUE ] 〜 

requested[negotiated==TRUE ], data=employees) 


Behind 
the Scenes 




Under the hood, R is using this formula to 
calculate the R.M.S. error: 


o y * Vl-r 


2 


The standard 

deviation o-P y. TViC to\r\rcla*tio\r\ toedk* ⑼七 . 



mode\ o' 0 ’ 


theve^e no ^ 

Dumb Questions 


Do I need to memorize that formula? 


As you’ll see in just a second, it's pretty easy to calculate the 
R.M.S. error inside of R or any other statistical software package. 
What’s most important for you to know is that error can be described 
and used quantitatively, and that you should always be able to 
describe the error of your predictions. 


Do all types of regression use this same formula to 
describe error? 

If you get into nonlinear or multiple regression, you'll use 
different formulas to specify error. In fact, even within linear 
regression there are more ways of describing variation than R.M.S. 
error. There are all sorts of statistical tools available to measure error, 
depending on what you need to know specifically. 
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Next, color in an error band 
across your entire regression 
line to represent your R.M.S. 
error. 

The error band should follow 
along the regression line 
and the thickness above and 
below the line should be 
equal to one R.M.S. error. 


S*tav-b youv- cv-v-ov- 
bay\d V^c\rc. 





Tost DriVq 


Instead of filling in the algebraic equation to get the R.M.S. error, 
let’s have R do it for us. 


Take a look at R’s summary of your model by entering this command: 

summary(myLm) 


Your R.M.S. error will be in the output, but you can also type this to see the error: 


summary(myLm)$sigma 




The R./W.S. cvv*ov is also 
匕 ailed ov* Vcsidual 

cv*v*o\r. w 



■mm 
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query your linear model object 


R's summary of your linear 
model shows your R.M.S. error 


When you ask R to summarize your 
linear model object, it gives you a 
bunch of information about what’s 
inside the object. 


a summav-y 
of youv model- 



mode\ 


R all sorts d tWmy *to tell 
you dbou*k you\r lmC3V" model. 


Not only do you see your regression 
coefficients, like you saw in the 
previous chapter, but you also see the 
R.M.S. error and a bunch of other 
statistics to describe the model. 


/W \\tvts youv R.M.S. CV-\rov| 



|-(* you dv~dv/ d bdir>d 七 h 七 s abou 七 2*3 

above ^y\A bclov/ youv v-c^vcss*io^ l*mc, 
you a spvcad looks like 七 his. 
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error 


< ^Sharpen your pencil 


You’re ready to have another go at your compensation algorithm. 
Can you incorporate a more nuanced conception of chance error? 


How would you change this algorithm to incorporate your R.M.S. error? Write 
your answer inside the Raise Reckoner. 



/\ddl youv "to 

Raise Rctko^c\r 


The Raise Reckoner 

_ 

What will happen if we request a certain 
amount of money? Find out with this equation: 



Where x is the amount requested, andy 
is the amount we can expect to receive. 

But the formula only works if your requested 
amount (x) is between 0% and 22%. 

Please mjic your own recu'.i^ mayvary 
H prediction because ofuiiai：cc 


>u 


delete la^uay. 



Signit* codes:0 




0.801 1 


Use R.M.S. error *to 

im^ovc youV" sljovi'b^w'- 


Residual standard error: 2.298 


Multiple R-squared: 0*4431, Ad 
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some clients dislike uncertainty 

^Sharpen your pencil 

Solution 



Let’s take a look at your new algorithm, complete with R.M.S. 
error for your regression. 


The raise Reckoner 


What will happen if we request a certain 
amount of money? Find out with this equation: 


i vouv- v/IVith 

>ova*tcs R-M-S. cv*v*ov*. 



This s-tatcrwCht tells youv- ^lichts 

the they should expert 

thci\r oy/h raise h> be inside o^. 



Where x is the amount requested, andy 
is the amount we can expect to receive. 

But the formula only works if your requested 
amount (x) is between 0% and 22%. 


/l/]ost but net dll vaiscs y/ill be v/i-th'm d varv^c o-P 
Z.5% r^ovc or less 七 ha 於 {}\t pvcdidtioy>. 




So if I ask for 7% t III get 4.5 — 9.5% 
back? I just need more than that if 
you want me to take you seriously. Can 
you give me a prediction with a lower 
amount of error, please? 


O 


She has a point. 

Is there anything you can do to make this 
regression more useful? Gan you look at 
your data in a way that reduces the error? 



342 


Chapter 11 























Look at different strips on your scatterplot. Is the R.M.S. error different at the various strips 
along the regression line? 



Ov\t s*tv*ip hcvc is Aoy\c (or you- 


Do you see segments where the residuals are fundamentally different? 




ExeftciS^ 


For each strip on the scatterplot, color in what you think 
the error is within that strip. 
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error varies across your graph 



TV>C CV"V"OV is loy/CS*b 3' 



The error is a lot 
over here. 


Why is the error higher on 
the right side? 

Look at the data and think about 
what it must mean. 




You’ve looked at the R.M.S. error for each strip. What did you find? 


(U：iSe 

utioH 


Payoffs for negotiators 



d— 


o 

Av 


o 




9OJOCJgl—0_. 


9 


0 
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error 


Jim: Oh man, that’s nuts! It looks like there’s a different 
spread of predictions for every strip along the scatterplot! 

Joe: Yeah, that’s crazy. Seriously. How in the world do we 
explain that to our customers? 

Jim: They’ll never buy it. If we say to them, your error is 
looking relatively low at 7—8 percent, but at 10—11 percent 
the error is through the roof, they just won’t get it. 

Frank: Hey, relax you guys. Maybe we should ask why 
the error bands look the way they do. It might help us 
understand what’s happening with all these raises. 

Jim: [Scqff\ There you go being all circumspect again. 

Frank: Well, we’re analysts, right? 

Joe: Fine. Let’s look at what people are asking for. At the 
start of the scale, there’s kind of a big spread that narrows 
as soon as we hit 5 percent or so. 

Jim: Yeah, and there are only 3 people who asked for less 
than 5 percent, so maybe we shouldn’t put too much stock 
in that error from 4—5 percent. 

Frank: Excellent! So now we’re looking at the range from 
5 percent all the way up to about 10 percent. The error is 
lowest there. 

Joe: Well, people are being conservative about what they’re 
asking for. And their bosses are reacting, well, conservatively. 

Frank: But then you get over 10 percent … 

Jim: And who knows what’ll happen to you. Think about 
it. 15 percent is a big raise. I wouldn’t have the guts to 
request that. Who knows what my boss would do? 

Frank: Interesting hypothesis. Your boss might reward you 
for being so bold, or she might kick your butt for being so 
audacious. 

Jim: Once you start asking for a lot of money, anything can 
happen. 

Joe: You know, guys, I think we’ve got two different groups 
of people in this data. In fact, I think we may even have 
two different models. 



What would your 
analysis look like if 
you split your data? 
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what segmentation really is 


TKis cvvov estimate is *too lov/. 


Payoffs for negotiators 



When we looked at the strips, we saw that the 
error in the two regions is quite different. In 
fact, segmenting the data into two groups, 
giving each a model, would provide a more 
realistic explanation of what’s going on. 

Segmenting your data into two groups will 
help you manage error by providing more 
sensible statistics to describe what happens in 
each region. 


These CV*V*o\r cs-tir^a-tes 

3V-C move \rcalis-tid. 



Segmentation is all about managing error 

Splitting data into groups is called 
segmentation, and you do it when having 
multiple predictive models for subgroups will 
result in less error over all than one model. 

On a single model, the error estimate for 
people who ask for 10 percent or less is too 
high, and the error estimate for people who 

ask for more than 10 percent is too lozv\ payoffs for 
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error 


Sharpen your pencil 


If you segment your data between people who requested less than 
10 percent and people who requested more than 10 percent, chances 
are, your regression lines will look different. 


Here’s the split data. Draw what you think the regression lines are for these two sets of data. 



七 : *tV>c dois sfvcad 
out ov\ *tV>c vi # 七 side- 
ThaVs 0^—jus*t do 
bcs*t bo cs*t*ima*tc 
v/Kcvc 七 he Imc ^ocs. 


Rcmcmbcv ： v-cycssio^ I'me is -the I'me 
七 bes 七 -fits jv-aph o-f averages. 
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two regression lines 


^Sharpen your pencil 

Solution 


You've created two regression lines — two separate models! 
What do they look like? 


TK'IS l*mc 七 Wou# 七 lie fcoflc >wV>o make 
loy/ should *f i*t da*t3 mudh 

bc*t*tcv* OVl^mal model. 


Payoffs for negotiators 




ttevVs youv* ovi^mal model. 


The ytyrtss\or\ I'mc 
move agycssivc ^c^otiatovs 
should i^ave a d*i-f-fcv-cr>*t 
slope -fvom 七 V>e o*thcv- I'mc- 


348 Chapter 11 
















error 



Two regression lines, 
huh? Why not twenty? 

I could draw a separate 
regression line for each strip... 
how would you like that?!? 


This is a good one. Why stop at two 
regression lines? Would having more 
lines — a lot more, say — make your model 
more useful? 


you are here ► 


349 








balance explanation and prediction 


ftood regressions balance 
explanation and prediction 


Two segments in your raise regression will let you 
fit the data without going to the extreme of too 
much explanation or too much prediction. As a 
result, your model will be useful. 



The model -fits cvcv-y 
daid fom 七 . 


丁 he model -Pi-ts mdhy 
possible doh-pigu\ra'tiohS 
data poihts. 


Your analysis should be 
somewhere in the middle. 


You’ve rwasic\rcd 七 he data, bui 

you 匕 a / 七 pv-edi^t 


tJiereia ： 

Dumb 


are no o 

Questions 



Youv* pv-cdid*tioy> y/ill be adduvatc> 
its y\o{, pvctisc bo be useful. 


Q/ Why would I stop at splitting the 
data into 2 groups? Why not split them 
into 5 groups? 

If you’ve got a good reason to do it, 
then go right ahead. 

I could go nuts and split the 
data into 3,000 groups. That’s as many 
"segments” as there are data points. 

You certainly could. And if you did, how 
powerful do you think your 3,000 regressions 
would be at predicting people's raises? 

Ummm... 


If you did that, you’d be able to explain 
everything. All your data points would be 
accounted for, and the R.M.S. error of your 
regression equations would all be zero. But 
your models would have lost all ability to 
predict anything. 

So what would an analysis look like 
that had a whole lot of predictive power 
but not a lot of explanatory power? 

It'd look something like your first 
model. Say your model was this: “No matter 
what you ask for, you’ll receive somewhere 
between -1,000 percent and 1,000 percent 
in raises." 

That just sounds dumb. 


Sure, but it’s a model that has 
incredible predictive power. The chances 
are that no one you ever meet will be 
outside that range. But the model doesn’t 
explain anything. With a model like that, you 
sacrifice explanatory power to get predictive 
power. 

So that’s what zero error looks like: 
no ability to predict anything. 

That’s it! Your analysis should be 
somewhere between having complete 
explanatory power and complete predictive 
power. And where you fall between those 
two extremes has to do with your best 
judgemnt as an analyst. What sort of model 
does your client need? 
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error 


5 10 15 20 

Requested 


Sharpen your pencil 


For each of these two models, color in bands that represent 
R.M.S. error. 


Pv-av/ bdv^dis bo describe 

dis*b\r'»bu*tior\ o-f 
v-csiduals *fo\r model. 
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error under control 


Agjv-cssivc 

^Cjo-tia-tovs 



Let’s implement these models in R. 


Your new model for timid 
negotiators does a better job 
fitting the data. 

The slope of the regression line 
is more on target, and the R.M.S. 
error is lower. 



Timid y>C5otia*to\rs 


Your new model for aggressive 
negotiators is a better fit, too. 

The slope is more on target, and the 
R.M.S. error is higher, which more 
accurately represents what people 
experience when they ask for more 
than 10 percent. 
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They’re more powerful 
because they do a better job 
of describing what’s actually 
happening when people ask for 


raises. 




另 — 


Your segmented models manage 
error better thaw the original model 


9 广 0L 

pa)><Do(Dy 











error 




It’s time to implement those new models and segments in R. Once you have the models 
created, you’ll be able to use the coefficients to refine your raise prediction algorithm. 


Create new linear model objects that correspond to your two segments by typing the following 
at the command line: 


TWis todt tclU R -to look only at 
data m youv- database -for 

myLmBig <- lm(received[negotiated==TRUE & requested > 10 ] 〜 
requested[negotiated==TRUE & requested > 10], 
data=employees) 

myLmSmall <- lm(received[negotiated==TRUE & requested <= 10 ] 〜 
requested[negotiated==TRUE & requested <= 10], 
data=employees) 

...ay>d h> spirt ihc segments ai 
■the 10% v-aisc 




Look at the summaries of both linear model objects using these versions of the summary () 
function. Annotate these commands to show what each one does: 


summary(myLmSmall)$coefficients 
summary(myLmSmall)$sigma 
summary(myLmBig)$coefficients 
summary(myLmBig)$sigma 

These vcsul*U will make youv 

mudh move fov/CV-ful- 
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revised r.m.s. error values 



You just ran two new regressions on segmented data. What did you find? 



you -tell R *to 
tV-ca*tc *tV^C \r\CW models ， 
R docs^-b display 

m *b^C £.o\r\Solc- 


But quite a lot 
happChS behind 
the SdChCs/ 


3041 OSS02^e-l34 


> nryLmBig <- lm(received[ne 9 ati.ated—Tl!EJ£ & requestecf > nequ 苷 s.ted[negat;i □ 爸骨 (^■■TFUE, 县 > 

rcqytstcd > 10], data*employees) 

> myLfflSi^alT <- 1 rece ived [rtegot i a It ed RUE £ requested 10] -requfi sted [rtego t i a-fc e cf- -TRUE £ 

requested I®] h dat Employ□□ 

> suf™ryCrriyLpSnall>lco€ff\cients 

isti-rnat^ Strf r Error t wlue PrC>St O 

Clntercept^ 0.7933468 0 U 2Z47Z0^9 3.S3S37B 4.37BlS6e 

reque%te^[ne^Qti£3t0cl -- TRUE ^ requested 10] B.04|494i ^.03l^lS3S 29-903( 

^ sui_a ryCrryLnSrol 1 〕 Ss i grra 
[1] 1.374S2£ 

> surmra ry(rryLrB i g^$cc>i s P€ l> i ci triits 

Estirnate 

<InterE ； ept) \ 7, $114^33 

request ed [negtot i s ted TRUE SS^requcsted > 1@3 0 - S@Z<6090 

> SuflMa ry^fflyLr；B i i gf 

[1] 4.544424 


Ntd. Error t valy€ PN>lt1) 
47603^1 4.I&4N5 4 F 5r37597e^5 
1420151 2 . 1MSZ4 3.4S7618e ^2 


ttc\rc a\rc *thc 
R.M.S. CV-V-OV-S 
youv y\cvj models. 


ov- 


Wcv-c arc the slope 
ih-tev^epts 
-Po\r you\r hCw 
\rcg\rcssioh I'mes. 
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error 


Sharpen your pencil 


You now have everything you need to create a much more 
powerful algorithm that will help your customers understand 
what to expect no matter what level of raise they request. Time 
to toss out the old algorithm and incorporate everything you’ve 
learned into the new one. 


Using the slopes and intercepts of your new models, write the 
equations to describe both of them. 


For what levels of raises does each model apply? 


How close to the prediction should your client expect her own raise to be, 
depending on which model she uses? 



The raise Reckoner 


What will happen if we request a 
certain amount of money? 


Youv- ar\sy/cv-s W\W be 

youv- nev/ 


Pern’ 七 -fov-yt about 
dvo'id'm^ e%*brafolaticm! 


Th'mk about 
R.M.S. cvvov. 
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algorithmic denouement 


Sharpen your pencil 

Solution 


What is your final compensation algorithm? 



Hcvc^s 七 I 化 rwodcl 
-foV* Smdll VC'UCsts 


Were s the model 
-Po\r big \rc<\ucstr 


The raise reckonepj 


What will happen if we requesf a certain amount 
of money? Say x is the amount requested, and 
y is the amount we can expect to receive. 

If you ask for less than 10%, use this equation: 




y=0.8+0.9x 



Your raise will be plus or minus 
1.4% of the predicted value. 

If you ask for 10% or more, use this equation 


This v/a\rr>s -the dieirt 灼。七 "to 



You used 七 he 

-to -fill m 

"the \rcy~CSsio 灼 c^ua-tior>s. 


Wtrts 

R.M.S. cv*vov*. 


rtcv-cs 

民 .M.S. trror. 
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error 


Your clients arc returning iw droves 

Your new algorithm is really starting to pay off, 
and everyone’s excited about it. 





Now people can decide whether they want to 
take the riskier strategy of asking for a lot of 
money or just would rather play it safe and ask 
for less. 

The people who want to play it safe are getting 
what they want, and the risk-takers understand 
what they’re getting into when they ask for a 
lot. 
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12 relcttlonal databases 



How do you structure really, really multivariate data? 

A spreadsheet has only two dimensions: rows and columns. And if you have a bunch of 
dimensions of data, the tabular format gets old really quickly. In this chapter, you’re about 
to see firsthand where spreadsheets make it really hard to manage multivariate data and 
learn how relational database management systems make it easy to store and retrieve 
countless permutations of multivariate data. 


this is a new chapter 
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magazine performance analysis 


The Pataville Pispatch wants to analyze sales 

The Dataville Dispatch is a popular news magazine, read by most 
of Dataville’s residents. And the Dispatch has a very specific 
question for you: they want to tie the number of articles per 
issue to sales of their magazine and find an optimum number 
of articles to write. 

They want each issue be as cost effective as possible. If putting 
a hundred articles in each issue doesn’t get them any more sales 
than putting fifty articles in each issue, they don’t want to do it. 

On the other hand, if fifty article issues correlate to more sales 
than ten article issues, they’ll want to go with the fifty articles. 


The Perfect Equation? PMP 
HF Stats FTW W the 

Innovative book helps thousands H©W 

grapple with numbers CTaZ 1 


PMP: Software 

the 1 Development 

fieW How a whiteboard could 

craze save your next project 




The 

Secret 

Life 

of the 

World’s 

Databases 

Newly published 
tables reveal 
surprising new 
connections 





www.headfirstlabs.com 


They’ll give you free advertising 

for your analytics business for a year 
if you can give them a thorough 
analysis of these variables. 
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relational databases 


Here's the data they keep to track their operations 


Looks like -they keep 
tv-adk o( B \o{, o( stud 





What do you need to know in order 
to compare articles to sales? 



The Dispatch has sent you the data they use to manage 
their operations as four separate spreadsheet files. The 
files all relate to each other in some way, and in order 
to analyze them, you’ll need to figure out how. 


Hov/ do -these 
daia -tables vcla*tc 
*to eadh o*tV>cv? 
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how does the data relate? 


You need to know how the data 
tables relate to each other 


The table or tables you create to get the answers that the 
Dispatch wants will tie article count to sales. 

So you need to know how all these tables relate to each other. 
What specific data fields tie them together? And beyond that, 
what is the meaning of the relationships? 


Hcvc is v/Ka*b P'ispa*t^^ 
V^as to say about how 


From: Dataville Dispatch 
To: Head First 
Subject: About our data 

° f 巾㊀ ma 9 azine has a 
=unch 0 f articles, and each article has ar 

to hp f，S t° ! n ° Ur data We tle the auth0 「s 

心 t^e articles. When we have an issue 
ready, we call our list of wholesalers 
They place orders for each issue, which 
we record in our sales table. The “lot size 
m = tabje youYe looking at counts the 
number of copies of that issue that we 

sell—usually in denominations of 100 but 

sometimes we sell less. Does that help? 
- DD 


They have a lo-t o( s-tu-P-P -to record, 
is why -they need all these spreadsheets. 
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relational databases 


- ^harpen your pencil 


Draw arrows and use words to describe the relationship 
between the things being recorded in each spreadsheet. 


Dvav/ arrows bctv/cc^ the -tables a^d 
describe hov/ cadh v-claics -to 七 he o-thev-. 



——-- "1 

a 

% 


1 dirtflOrDD 

iutlW 



1 Jaun Wightmnn 





4 

3 Paul Stfmnec 


S 

A Hall Jj；nnpjf 


6 

S RfldnrirnH±fir 


7 

S N ke ChHtlAbfk 


| £ 

? [M mIc Fry 



g DMSFtinv 1 AMms 

!J 


% C*rt« 



■Mj mLi IrBlSUH! ,y 

a 
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relationships identified 


r Sharpen your pencil 

Solution 


What relationships did you discover among the spreadsheets 
that the Dataville Dispatch keeps? 


Eadh sale rc-fers *to a bundle o-f topics 
(usually dVOUhd 100) of OhC issue. 



au*bKov 

y/vi*tcs d burvd.li 
of a\rtid.lcs. 




A I 4 i 4 T a s # 4 * ^ 3- d 4- ] l- f f 4 1 V 4 
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relational databases 


A database is a collection of data with 
well-specified relations to each other 


A database is a table or collection of tables 
that manage data in a way that makes these 
relationships explicit to each other. Database 
software manages these tables, and you have a lot 
of different choices of database software. 




ovaahiia-tio^s -that co\\tt{, -the same 
type o( data, out—o-P—-the—box. dd'tdbases 
spcdi-Pidally r^a^ajc 七 ha 七 sov-t o( data. 






Database 


ade 'implementation 


Database software 




What’s really important is that you know the 
relationships within the software of the data 
you want to record. 


times, people Y\ccd v-cally 

spedi-fi^ *bo *t^civ- needs, by\A *thcy1l make 
•biieiv ovm dd*tdbdse W\{\) Oy"Bt\c, MyS^L, 
ov somc*thmJ else u)r\dle\r -the hood. 



Database 


Hcv-c ； s 七 he bi^ ^ucstiov>. 


So how do you use this 
knowledge to calculate 
article count and sales 


total for each issue? 
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follow the bread crumbs 


Trace a path through the relations 
to make the comparison you need 


When you have a bunch of 
tables that are separate but 
linked through their data, 
and you have a question you 
want to answer that involves 
multiple tables, you need to 
trace the paths among the 
tables that are relevant. 



These av-c 七 he tables) 
r\tt& *to pull -toyt^cv-. 


TVi'is sfv-cadsiicc*t \sr!i 

bo you Compare 

article toun*b sales. 



Create a spreadsheet that 
goes across that path 

Once you know which tables you need, then 
you can come up with a plan to tie the data 
together with formulas. 

Here, you need a table that compares article 
count and sales for each issue. You’ll need to 
write formulas to calculate those values. 



卜 the hext youll 

^akulatc these values. 


Issue 

Article count 

Sales Total 

1 

5 

1250 

2 

7 

1800 

3 

8 

1500 

4 

6 

1000 


You II Y\Ctd -fo\rmulas -fov "these. 
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relational databases 



Let’s create a spreadsheet like the one in the facing page and start by 
calculating the “Article count” for each issue of the Dispatch. 


o Open the hfda_chl2_issues.csv file and save a copy for your work. Remember, you don’t want to mess 
up an original file! Call your new file “dispatch analysis.xls". 

Save -this -Pile uhdev a ^cv/ hame ； so 
you dor / 七 dcs-tvoy -the dd'td- 



fliese! 




www. headfirstlabs. com/books/hfda/ 
hfda chi2 issues.csv 


www. headfirstlabs. com/books/hfda/ 
hfda chi2 articles.csv 


hfda\ 


❺ 


Open hfda_chl2_articles.csv and right-click an 
the tab that list the file name at the bottom of the 
sheet. Tell your spreadsheet to move the file to your 
dispatch analysis.xls document. 


❺ Create a column for Article count on your issue sheet. Write 
number of articles for that issue, and copy and paste that for 



opy yoviv- av"*t»^lcs 
*to yoiAV- 
y/ dodumCht 



-folrmub iiCV*C- 
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article countin 1 



What sort of article count did you find each issue to have? 


ExeRciSe 

SotuiloH 


o Open the hfda_chl2_issues.csv file and save a copy for your work. Remember, you don’t want to mess 
up an original file! Call your new file “dispatch analysis.xls". 

❺ Open hfda_chl2_articles.csv and right-click an the tab that list the file name at the bottom of the 
sheet. Tell your spreadsheet to move the file to your dispatch analysis.xls document. 


❺ 


Create a column for Article count on your issue sheet. Write a count if formula to count the 
number of articles for that issue, and copy and paste that formula for each issue. 

The -Pov-mula looks a-t "the w a\rtidlcs w 
tab m youv- spreadsheet、^ 


sl 

二 COWNTlPOvfda dUZ ar*tidlcs.dsv|B ： B,h-fda dhlZ issucs.dsv|/\Z) 


|*t dou^*ts the of 

■times issue shov/s up 
m the list avtidlcs. 







A 

E 

o l! 


issuelP Rubble 

Article count 



1 

10/2^/04 


- - 


2 

ii ； a/04 

5 



3 

U/23/04 

7 



4 

12/O/Od 

7 



b 





5 

l/?/05 

? 


a 

7 

I/2Z/05 

7 


分 

B 

2/6/OS 

7 


10 

9 

2/21/05 

G 


it 

IQ 

3/fl/fiS 

5 



11 

3/23/OS 

■9 


13 

12 

4/7/05 

7 


14 

13 

4f22/05 

& 


15 

14 

5/7/OS 

5 


16 

15 

Sf22/0^ 

e 


17 

16 

b/b/Qb 

? 


18 

17 

&/21/05 

10 


!9 

IB 

Wos 

7 

L 



耶 /05 




■aW il i 


_ ^ 

—，… 83 1 


I) 


■碎 ilfHD 


TWis is ^ VtidCtab m 
dispaUVi analysis s\>\rcadsVicc 
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relational databases 



Add a field for sales totals to the spreadsheet you are creating. 

L^dfhis! ^ 


Sounds good... let’s actct 
sales to tkis list! 


Cool! When you add the sales figures to your spreadsheet, 
keep in mind that the numbers just refer to units of the 
magazine, not dollars. I really just need you to measure sales 
in terms of the number of magazines sold, not in dollar terms. 


Wtrts 七 P*ispa*U^s edrtov. 


o 


o 


www. headfirstlabs. com/books/hfda/ 
hfda chi2 sales.csv 

C0 _e 一一 ^ 
new tab in your dispatch analysis.xls. 
Create a new column for Sales on 
the same sheet you used to count the 
articles. 


o 


A 

LssucID 


B 

Pu basic 


[ 

Anid-c count Sales 



i 

!Qf24/m 

7 



z 

ii/a/04 



A 

3 

11/23/04 

7 / 


s 

4 

1Z/ 邮 4 

7 / 




1a/23/04 



L 7 

■a eh n _ 

S 1/7/OS 


1 !• 


fydd Column 9v\d 

youV" -formulas iiCV"C- 


Use the sumif formula to tally the sales figures for issuelD #1, putting the formula in cell C2. 
Copy that formula and then paste it for each of the other issues. 
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incorporate sales 



What formula did you use to add sales to your spreadsheet? 


This ^^ula shows that _______ The -fivst o( ihe SIX/UlF 

^issue sold 7-)Z2-l units. -foirmula looks at "the issues- 

—SUMlFOi-fda t\\\2. saldsv’B:B，Ivfda dhlZ issucs.dsv^Z, Ivfda dhlZ sales.dsv^DC) 


n ^ 






A 


e 


c 


E 



1 10/24/04 

1 

2227 


U J 

2 1/0/0-^ 

S 

703 


4 

3 \l/2V04 

7 

2252 


5l 

4 ) 

L2/B/D4 

7 

之 1 卽 


6 

5 1 ； 


白 



7 

6 

i/7/as 

7 

2QO^ 


Is 

? J 

L'22/OS 

7 

2140 


9 

8 

^6/05 

1 

2306 


[10 

9 4 

1/21/05 

6 

1711 


■ u 

10 

_05 

5 

1227 


12 

11 : 

3/23/DS 

9 

3642 


13 

12 

4/7/D^ 

7 

2153 



13 ^ 

niim 

6 

1B26 


Us 

14 i 

f 5/7 / DS 

6 

ISM 


[16 

15 f 

s/22/OS 

& 

1406 


- 17 

16 / 

6/6/05 

7 

2219 



17 t 

TWTTZTTVdM Y*£y 

?/21/05 

10 

4035 


U] 

7~ 




llBnKDO^tvvif - ^ 


TKc sctond av*^umcir>*t looks a*t "the spcdi-f id 
•issue y/hosc sdlcs you y/3v>*t *to douir>*t- 


'^El 


T 


at*tual sales -f igures you y/ant *to s 
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relational databases 


Your summary tics article 
couwt and sales together 

This is exactly the spreadsheet you need to tell you whether 
there is a relationship between the number of articles that the 
Dataville Dispatch publishes every issue and their sales. 

l This seems nice. But ifd be a ) 
f little easier to understand if it o 

l were made into a scatterplot. Have ) 

(you ever heard of scatterplots? ) 


Delinitely! Let’s 
let kim kave it … 



^|harp€n your pencil 


o 


❺ 


Open R and type the getwd () command to figure out 
where R keeps its data files. Then, in your spreadsheet, go to 
File > Save As... and save your data as a CSV into that directory. 

Execute this command to load your data into R: 

dispatch <- read.csv("dispatch analysis.csv ", 
header=TRUE) 



Name youv -f ile 
d*ispa*tdV> a^alysis.dsv. 

Once you have your data loaded, execute this function. Do you 
see an optimal value? 

plot(Sales 〜 jitter(Article.count),data=dispatch) 

You II see ho>w j itter 
y/OV~ks m d SCdoir>d... 


This fuM 七 ior> -tells you R's v/o^rkm^ 
dived-bov-y, y/heve *i*t looks (or -f iles. 



dHrcdtovy. 
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engage the optimimum 


^I^Sbarpen your pencil 

Solution 


Did you find an optimal value in the data you loaded? 


The optimum appears *to Idc around \0 a\rti^lcs. 



Use *this dommd^d *to lo3d youv* CSV m*to R- 


"Hie head shows you 

十 t you have jus 七 loaded... 
its always good "to 




4 


dispcitch c read. n dispatch onalysi s, -csv", header=TRUE) 

> head(dispatch) 

issueTO PubDate Article.count So 


1 1^/24/04 

2 

3 11/23/04 

4 12/S/04 

5 12/2^/m 

6 1/7/05 


7 Z2Z? 
5 703 

7 2252 

7 21S@ 

8 2894 
7 zmb 


plot (Sflles-j i tt erfArt L cle .count) t data=dispatch) 


> 


dorwma^d 


Wtrts 

七 V>a 七 d??3?S^ouP 
sda*t*tcv-flo*t- 


Thc 』七七饮 adds a little 

lp»t of hoise -fco youv hurhbg\rs, wKi/k 


_ 

sepav-ates therw a little a^d makes £hei 

casiev -to see oh -the sdat-tcv-plot. 


T\ry \ruirmm 3 *thc sdmC 乙 ommd 灼 d wrthout 
add'm^ jittev -； is^*t the result hav-d -to v-cad? 


Make suve 七 ha 七 *tV>c -f ield 扒 aws 
aV plot *fo\rmuU 
y>ames 七 ha 七 head shows you 
av*c dor>*tclmcd \y\ *thc -fv^amc- 


\y\ youv 
-f ield » 
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5 10 15 20 25 30 

jitter(counts) 




I 七 looks like 
a pv-etty steady 
md\rcasc m sales 
as mo\rC av-tidlcs 
av-c added... 


there wcv-c -five av-*tidlcs m the Pispatdh, 
only 1,000 ov -Pewev ^ofics o( the issue sold. 


… ⑽七 il thc\rc a\rc about IO arhdts, 

3 七 v/hidh ponrrt move 3\rti^lcs 
doesh *t associate v/i-th mtveased sales. 



oqsjqd 

o &§ 



Q)C5^®%o 

o 

0^60 


o 


o 
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happy clients 


Looks like your scattcrplot 
is going over really well 


"■ U atav/|/| 


Sounds like mov-c wov-k 
-fov- you... a>wcsomcf 




Subject: Thank you 
丁 hank you! 丁 hk i 。 

p _ of suspecteVS a b，9 / ^ f ° r Us ' 

l,ke this was the casVl f relati onship 
demo ^trate d it ^alys/s 

_ 二:广 a fr ee year of 

^ -or d about your alV z ll%Ts read 

: can ^anlV^Z ：? 00 


doymyi—t c-f-fusivc| 


thereicire no ^ 

Dumb Questions 


Do people actually store data in linked spreadsheets like 

that? 

Definitely. Sometimes you’ll receive extracts from larger 
databases, and sometimes you’ll get data that people have manually 
kept linked together like that. 

Basically, as long as there are those codes that the 
formulas can read, linking everything with spreadsheets is 
tedious but not impossible. 


Well, you’re not always so lucky to recieve data from multiple 
tables that have neat little codes linking them together. Often, the 
data comes to you in a messy state, and in order to make the 
spreadsheets work together with formulas, you need to do some 
clean-up work on the data. You'll learn more about how to do that in 
the next chapter. 

Is there some better software mechanism for tying data 
from different tables together? 

You’d think so, right? 
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Copying and pasting all that data was a pain 


It would suck to go through that process every 
time someone wanted to query (that is, to ask a 
question of) their data. 

Besides, aren’t computers supposed to be able to 
do all that grunt work for you? 


Wouldn t it be dreamy if there 
were a way to maintain data relations in 
a way thafd make it easier to ask the 
database questions? But I know it's 
just a fantasy... 
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meet rdbms 


Relational databases manage relations for you 


One of the most important and powerful ways of managing 
data is the RDBMS or relational database management 
system. Relational databases are a huge topic, and the more 
you understand them, the more use you’ll be able to squeeze 
out of any data you have stored in them. 




to y t a 叫 


•• build ohc o-P these. 



TWis d.iayam sV^ows i\\t \rclatio^*»ps a^d 
data tables ms'ide a relational database. 




This -f ield is a key 


RDBMS table 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


data C—^r 



What is important for you to know is that the relations that 
the database enforces among tables are quantitative. The 
database doesn’t care what an “issue” or an “author” is; it just 
knows that one issue has multiple authors. 


RDBMS table 


data 


data 


data 


data 



RDBMS table 


data 


data 


data 


data 



^cys a\rc values -tha-t 

vc^ovds u^i^ucly- 


Each row of the RDBMS has a unique key, which you’ll often 
see called IDs, and it it uses the keys to make sure that these 
quantitative relationships are never violated. Once you have a 
RDBMS, watch out: well-formed relational data is a treasure 
trove for data analysts. 


If the Dataville Dispatch had a 
RDBMS, it would be a lot easier 
to come up with analyses like the 
one you just did. 
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Pataville Pispatch built m RPPMS 
with your relationship diagram 

It was about time that the Dispatch loaded all 
those spreadsheets into a real RDBMS. With 
the diagram you brainstormed, along with 
the managing editor’s explanation of their 
data, a database architect pulled together this 
relational database. 


Sharp your pencil 


Here is the schema for the Dataville 
Dispatch's database. Circle the tables 
that you’d need to pull together into 
a single table in order to show which 
author has the articles with the most 
web hits and web comments. 

Then draw the table below that 
would show the fields you’d need in 
order create those scatterplots. 


Now that we’ve found the optimum 
article count, we should figure out who our 
most popular authors are so that we can 
make sure they're always in each issue. You 
could count the web hits and comments that 
each article gets for each author. 



Wholesalers 


WholesalerlD 


data field 


data field 


data field 


Issues 


IssuelD 


data field 


EditorlD 


data field 


data field 


Sales 


SalelD 


IssuelD ^ — V 


WholesalerlD 


data field 




Articles 

Authors 


ArticlelD ~w 

AuthorlD 



IssuelD 

AuthorName 


~► 

AuthorlD 。一 v 

data field 


web hits 

data field 

f 

CommentID 


Dv-aw -the -table 
Y\tcd -to have hcv-c 


f ou 



This -field is 

This -tables IS … rt’s a lis 七 
o( 七 he 3V*ti£>lc 

y*ts m edition- 


WebComments 


CommentID 


ArticlelD 


data field 


data field 
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find your tables 


Sharpen your pencil 
< Solutlan 


What tables do you need to join together so that you can 
evaluate each author’s popularity, by counting the web hits and 
comments that author receives? 


/ou need a tabic ihai dv-aws these 
tables Wov^ -the database -together. 



Issues 


IssuelD 0 ■* 


data field 


EditorlD 0 ■■鬵 


data field 


data field 


ihc last table you used, tacM 



Wholesalers 


Wholesaler!D 


data field 


data field 


data field 


Sales 


SalelD 


IssuelD 0 


WholesalerlD O" 


data field 




Articles 

Authors 


ArticlelD 0 *n 

AuthorlD 0 ■’鬵 



IssuelD 0 

AuthorName 


— ► 

AuthorlD 0 1 *n 

data field 


web hits 

data field 


CommentID ^ 


yovj v-cfv-cscy>*b avtide. 

Comments 

1 

CommentID 0 ■’鬵 

1 

ArticlelD 0 — 鬵 

■ 

data field 

\ ■ 

data field 


is ihc au-thov o( both 
/ 3hd 2- -Po\r "this 

hypothctuial table. 
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Pataville Pispatch extracted your 
data using the SQL language 

SQL, or Structured Query Language^ is how data is 
extracted from relational databases. You can get your 
database to respond to your SQL questions either by 
tying the code directly or using a graphical interface 
that will create the SQL code for you. 


tteve’s ou 七 fu 七 -fv-om 

i\\t <\ucv-y you 

•the -table you \wairrt. 




Wtrts a simple £6^L <\ucv-y. 





www. headfirstlabs. com/books/hfda/ 
hfda chi2 articleHitsComments.csv 



You don’t have to learn SQL, but it’s a good idea. 
What’s crucial is that you understand how to 

ask the right questions of the database by 
understanding the tables inside the database and 
the relations among them. 


TK'lS c^uev-y VC*tuVy>S Mmc 

of -the au*tiio\r listed *m 
Author table 
Author ID -field *to 1- 


TV>c <\ucv-y 七 ha 七 dvcaicd this 

dd*tol IS moVC domflcX. *th 扣 

i\\t example oy\ -the 



Use the command below to load the hfda_chl2_ 
articleHitsComments.csv spreadsheet into R, and then take a look 
at the data with the head command: 


Make suve youVc 
*to 

七 he I 士 vm 七 (or 


articleHitsComments <- read.csv( 


-this 


"http :// www.headfirstlabs.com/books/hfda/ 
hfda_chl2_articleHitsComments.csv" , header=TRUE) 





We’re going to use a more powerful function to create scatterplots this time. 
Using these commands, load the lattice package and then run the 
xypiot formula to draw a “lattice” of scatterplots: 



library(lattice) 

xypiot(webHits 〜 commentCountIauthorName,data=articleHitsComments) 
This is a symbol/ — ^ 

What author or authors perform the best, based on these metrics? 

七 ha 七 you loaded- 



This is 七 he daia *fv-amc 
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lattice scatterplot visualization 



What do your scatterplots show? Do certain authors get greater sales? 


o 

❺ 


Load the hfda_chl2_articleHitsComments.csv spreadsheet into R. 

We’re going to use a more powerful function to create scatterplots this time. 
Using these commands, load the lattice package and then run the 
xypiot formula to draw a “lattice” of scatterplots: 


library(lattice) 

xypiot(webHits 〜 commentCountIauthorName,data=articleHitsComments) 


朵 d 

RL 


C=3 


x : j 星 j 、 ^ 



-.1 


「兑 



headIfl rtic 1 eH tt storm ents) 
article 10 auttiorNome webHits co^mentCoynt 


1 

Destiny Adams. 

2019 

14 

2 

Jon fiadermachcr 

1421 

G 

B 

Matt Jcinney 

1174 

3 

4 

Matt Jflnney 

1613 

ZQ 

5 

Paul Semenec 

1099 

W 

6 

Destiny Adorns 

1903 

26 


> libraryOattice) 

> xyjplotfhGbHtts—conir^ntCoLjnt 】 out ho r Name,, da t a =q r 1 1 cl eH 1 1 n t s ) 




sda*ticv-f lots by auihov ir>amc. 

This loads the lattice package. 


This data ma-tdhes 七 he 
"table you p\rcvisualized- 
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.This av-vay o( sda*t*tcv-flo*b shov/s >wcb iVrb ad 
dommc^ts (or cadV> avtitle, youfed by au*tV>ov. 



What author or authors perform the best on these metrics? 



The web stats ave all 
ovcv- -the map, y/ith 
authors Pcv-ro\ri^'mg 
vevy di-P-Pcv-c^-tly 
-P\ror» eadh o*thcv~. 


It’s p\rc*t*ty dlcar *tha*t Ra^faela Cortez* performs -the best. All her articles have Z f OOO or more web 
hi*ts, dhd hos*t show move 2-0 dommerrts. People secr^ vcally *to like hev*. A s -for *the 

rest -the au*tho\rs, some (like Dcs*t'my a^d Nidole) *tcr\d *to do better *thc v-cs*t- Nike has a 
pv-c*t*ty bi^ spread *m his pcr-forma^dc, while Bv-ews*tcv- ahd Jasoh -tc^d ho*t *to be boo popular. 


slxf _ 

- o 
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rdbms can get complex 


Hcvc s y/V^a*t 

editor V^as -bo say about 
yoiAV- mos-t v-ctcy\*t analysis. 



From:Dataville Dispatch 
Subject: About our data 

Wow，that really surprised me. I，d always 
suspected that Rafaela and Destiny 

t ^ e ^, our , star liters, but this shows 
that they re way ahead of everyone Biq 
=mo t;onforthem ! A 請 isinfo y rmati( ^ 

whilP P^hi 8 3 mUCh，eaner P ub,, 'cation 
whje enabling us to better reward our 

authors performance. Thank you. 

— DD 


Comparison possibil^Wr^W^H 
if your data is in a RPPMS 

The complex visualization you just did with data from the 
Dispatch 、RDMS just scratches the tip of the iceberg. Corporate 
databases can get big — really, really big. And what that means 
for you as an analyst is that the range of comparisons relational 
databases give you the ability to make is just enormous. 


RDBMS table 


data 


data 


data 


data 


RDBMS main 


data 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


RDBMSjable 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


RDBMS main 


data 


data 


data 


data 


data 


RDBMS main 


data 


data 


data 


data 


data 



Thmk about how -fav- you dart 

across -this sea o-P tables {o 
B b\ri||iarrt 乙 orwpavisoh. 


RDBMS main 


data 


data 


data 


data 


data 


RDBMSjable 


data 


data 


data 


data 


RDBMS.table 


RDBMS table 


data 


data 


data 


data 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


RDBMS main 


data 


data 


data 


data 


data 


RDBMS main 


data 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


Debases big... ^ally ; rcaWy big. 


RDBMS table 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


RDBMS table 


data 


data 


data 


data 


If you can envision it, a RDBMS can tie data together for 
powerful comparisons. Relational databases are a dream come 
true for analysts 


^ nc Patavillc PisfaUV/s database stv 
一 *is^*b \r\cav- bu-t 

databases easily jet tV^'is lavy. 


s*tv"u^*tu\rc 
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relational databases 


You're ow the cover 

The authors and editors of the Dataville 
Dispatch was so impressed by your work 
that they decided to feature you in their 
big data issue! Nice work job. Guess who 
wrote the big story? 




Looks \\V 

oy\ ： 
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13 cleaning delta 


餐 





Impose order 


參 


Your data is useless... 

...if it has messy structure. And a lot of people who collect data do a crummy job of 
maintaining a neat structure. If your data’s not neat, you can’t slice it or dice it, run 
formulas on it, or even really see it. You might as well just ignore it completely, right? 
Actually, you can do better. With a clear vision of how you need it to look and a few text 
manipulation tools, you can take the funkiest, craziest mess of data and whip it into 
something useful. 


this is a new chapter 















sa/vage a client list 



|J — 

InoiJt Farniu 



www. headfirstlabs. com/books/hfda/ 


Look at ^11 this s-fcu-f-f/ 


l/Vha 七 av-c you b> 

do *tV>is da*t 3 ? 


..too bad the data is a mess! 
In its current form, there’s not 
much they can do with this 
data. That’s why they called 
you. Gan you help? 



A 


B 

L 7 c 

a 

£ 


r 


d 


PwsujnlO^Fi rrtN jmc+Las 讲 ameifZI WPhwi 義 ID^Time 


J» 1 ? /ir-Alr^ ] J ?4» ： iR47 E K7ini/Dl/D« T2 ： i? 

：1 4M^RiHrndirhir}L-i-.mu-.^ in Bl? 72^^tA7MwQ1/Q7fOR 1S:17 

4 M*> A k«ullWdFe|ID : M^llpai 3lT646-^l^ZM754rt|/0^DS 14:M 

5 12?>I^A1 卽 13 相 12^1 i4ny53enPll4Z2ff7ia-SM-24«M425SlTOl/03/0S 15 ： J2 

b 油 YffUHIU 朽心 ; Ubll4Z>LAIl/^/Pi 

/' 3 l-iiP^cuuiriEfiiirKubtjd^iC] i 却】 ^dFioaibin; ? 此 /< ] anq 

S 75#rt Rewri in^ontBFjrH ID 7^dl_Uti93 的 O E41 44A£n42Sflid 1/0^08 «:» 

9 31_ A PC 咖 & 碑 4 的阳 91^n^lliPOQ5ffJlJ-4?B-«B&2b4Z^fl01/D5/D9 09:^0 

1*0- 352»^nidnna*Me|l&If 2_ 叫，比 财麻 16~Bl»77BM2iOlfOI/ 坊沖 S 09:41 

11 KRT'-llAmPiVinHuinl II] JDflJ ] i-lJtlWM/tli/UH. 14 j ： 1 ] 

i . 1 i & 4 ir n Ki 'br-h.ihmsi^ECK HM^irdfriai^Mnn j irxh f£D4m47timn/it//ax ] i：^s 

11 7n^l»qi^lihG*EwtT^^D ^il01«4£47 >U» S439Pt^5Hl/07/Oa 11:» 

M U&fTPDneivPArkilD l0^)^lX3fi5ll^|6-4e^$3Z5a42^4^1/O^D$ 15:^1 
U 1 似 >^[>{>ru^n_lngrd ID i 抑） mwlU24w7H&-^^ 14;2B 

；{p 1 ， AEITChnr]o^llMcC0v[l£l' 1 叫 m J-\ fJiM7UJial/€S/f^ 

37 fi&|AyjNEiJ^An[limS2dmvvtn7Un7U ， 74« 4^£!M7£7MM/fn/QR 13:57 

JH 61i A n V NcMy-uld^ ID di^10167^17-745-S6SM26filf0iy(n/aB 15：10 

W 45« A Ol7Url«flUS(ll!»JiS] Hirll6^1>e4^5I6-^)'?€P43£3kDl/O% r 0& 16 ： 2S 

^0 1 /yr-"-Bn^er*lileph^n3gn<IU 1 ^)^113 

.=■1 ?tr E K^yvknl4tciinH.i|ll] ^7 ] P4J / 1 KI a/l l/m 

?? fiii^ui-n|lnimj[kl} & 3]nrfnlpliMl 17Q7*M7 E7^ &R47ffl777iHD3/l1/aA S 7：IQ 

23 13&rWbOfilv*La<ID 13&]mtMl,D3Q2i7ia-40»-4134B4273lf01/l] l /Ba 1^:46 

24 42rf A &tdnleynT{nvnt0[lO42)Fi(l ： *ia314#212-77^-»I2iMZ>l!41/l^oa 1#A5 

2^ iayrLfrnaiRive[EO l^^r»irilJ^l*/l^ Q 咖 1 

] ftW^Aj\ lyjhi^ttfiiH'nvj 111 1 rr]CmlF] 0 ]Wllb 4 b ^ /MDl/lfl/dM CM：m 

27 U^OginUnHlvkC 口 nn—_P 1 拌 ) IMIllUllVn? M71ii4aT7»l/14/0ft ll：dl 

2i 6] 妒 Qlientmffiand(lC> 63jolp?i*ll?07H347^25-&H^I027W1/14/O$ 15:10 
i9 ltw^rtl l^ffHcffrTunElD !«]#! 明淋 

mri. nj-ioMiiB lid ^ I *<k anrnr^il ^』■， 

h htaa chU raw data , t3 I 








! 4 遇 1 咖 


Just got a client list from 
a dcfuwct competitor 

Your newest client, Head First Head Hunters, just 
received a list of job seekers from a defunct 
competitor. They had to spend big bucks to get it, 
but it’s hugely valuable. The people on this list are 
the best of the best, the most employable people 
around. 


This list could be a gold mine... 



Cilibn 

- II 

D / H 

， A_ 嵬， 

Hi" ^ ' 


toiit 

Id) 


hfda chi3 raw data.csv 



面 ■ t ■ 

■Ifl _ 

_ _ _ i 
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cleaning data 


The dirty secret of data analysis 



The dirty secret of data analysis is that as analyst 
you might spend more time cleaning data than 
analyzing it. Data often doesn’t arrive perfectly 
organized, so you’ll have to do some heavy text 
manipulation to get it into a useful format for 
analysis. 

This is 七 he *fu 灼 pav*t analysis. 


T"hc visioha\ry at wov-k 


Bu 七 youv* wovk as a d^tci 
»ay actually mvolvc a lo*t o( ihis … 



^Sharpen your pencil 


O Start retyping it. 


What will be your first step for dealing with this messy data? Take 
a look at each of these possibilities and write the pros and cons 
of each. 



Ask the client what she wants to do with the data once it’s cleaned. 



Write a formula to whip the data into shape. 
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head hunter goals 


^Sharpen your pencil 

Solution 


Which of these options did you choose as your first step? 


o Start retyping it. 

This sudks. |*t1l -take -fovcvcr, a^d 七 here’s a Jood 111 -tra^sdribc i*t *mdo\rvcd*tly, mcss'mj up 

data, l-f -this is ohly way *to -fix. -the dd*bd ； we’d better be suve bc-forc -this route. 

❺ Ask the client what she wants to do with the data once it’s cleaned. 

This is *thc y/ay *to ^o. W\{\\ idea o-f y/ha*t *thc dierrt y/ar\*b *bo do wi*th *thc dd*td) I ddh make 

sure 七 whatever I do puts *the da*ta m e%ad*tly *the -form 七 ha 七 -they r^ecd. 

o Write a formula to whip the data into shape. 

A poy/cv-ful -formula or *bwo would dc-f'mi*tdy Kelp ou*t> or\Ct y/C have dh idea o-f y/Ka*t *thc dd*bd 

heeds *bo look like -from -the dierrt. Bu 七 let’s -talk bo *thc dierrt 


Head First Head Huwtcrs wawts 
the list for their sales team 


o 


We need a call list for our sales team to use to contact 
prospects we don’t know. The list is of job seekers 
who have been placed by our old competitor, and we 
want to be the ones who put them in their next job. 


Even though the raw data is a mess, it looks like they just 
want to extract names and phone numbers. Shouldn’t be 
too much of a problem. Let’s get started... 
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cleaning data 


Sharpen your pencil 




The data looks like a list of names, which is what we’d 
expect from the client’s description of it. What you need is 
a clean layout of those names. 


Draw a picture that shows some columns and sample data for 
what you want the messy data to look like. 



Looks like 
•tiicsc av-c -f ield 
uf *tof. 


Pv*a>w youv idedl 
daia layout V>cv-c. 




A D t D E G 

H 

~n 

1 

PwsonlO#FlntName«las«^ame ： ff£l PwPhafie^CalilDrfTi me <. [- 



-1 

1J yft^Alp^jAHf4^unu^%[EUi 1J 休 Jlil liT 

^flU^RiH^nnirlLr.mu-.^ ID -nilUll 5n«4« SJ.7 7HSiijl3«iliiat/Q?/Da |2r H 

MTEkr 叫 ifWdF^ID 9^^10qi3ff64«-!Kl^ZW25JiOl/0^Da 14:54 
U7r~Ale；i 吒 10 U7}tmu5sen?11422P7ia^34 - 2 403C4J55 1/0 3/Da 16l 

Ib^j^cobvIK^ICI 曲 H.h2 



1 

1.7 、 

)2 


i 

1 刊 (TH ： UJ_hEr^_Kil:bl!nlj!IU PKKnbmjmit JW»]/U4/CH| 13:19 


75H^n<Ni1kn^i>ikRFjrE^II3 T^^rElil 9€41lrtl7 4^R4?UlTd 1/OS/Qi 

91H A P¥4ioffP4^ID ?1 卜昨 1㈤ 051^2 也册 2Ji4ZMIiOt/05/DS 仍 : 淖 
152r-£(l^nnd<ffe(^ID I^fdloa，lB 奶 3n646~fll?~ 掛 i4^On0I/D5/O3 的 :41 



u i i>4rr^iiebcluhiPK^Li：ii i iF4^nniri /■ /ai 3 



31 

14 

Sjs" 

7lT A Jaqiii-lihr-IFfwtnfi；ID ^PlQlUffU? 44H R414n47fi3lK]1/a7/aa 11 
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previsualize your columns 



rpen your pencil 
Solution 


How would you like your data to look once you’ve cleaned it up? 
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ioCtr\ dll masked ioythev- *m Column 
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to be split 匕 oUhS. 
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mail mcv-^c ov- v/cb pa^c ov y/ha*tcvcv else. 



^o*bta have *tV^ f^or\C r\umbcv-s.. 

•Beat’s the most important 
(or *tV^c sales -tcamf 


Pev-sohID 
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Las*tNa^e 

Pho^c 

111 
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Rasmussen 
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Brchdch 

Rasmussen 

i 午石一 0| Z 一 72/)0 

[E*U …] 

陳 J 

脱 …] 

CEUJ 




丁 Wis IP -field *»s usc-ful, smte 
*»*b will let you make suv-c 
■bV^at records art 


V Q[A -the hx a^d phohe — 

•fields s^araied ca^ o 七 h 饮 
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cleaning data 



It’s true: thinking about what neat 
data looks like won’t actually make it 
neat. But we needed to previsualize a 
solution before getting down into the 
messy data. Let’s take a look at our 
general strategy for fixing messy 
data and then start coding it... 
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plan your attack 


Cleanmg messy data is 
all about preparation 

It may go without saying, but cleaning 
data should begin like any other data 
project: making sure you have copies of 
the original data so that you can go back 
and check your work. 


Once you’ve figured out what you need 
your data to look like, you can then 
proceed to identify patterns in the 



Already d*»d 


messiness. 


The last thing you want to do is go back 
and change the data line by line — that 
would take forever — so if you can identify 
repetition in the messiness you can write 
formulas and functions that exploit the 
patterns to make the data neat. 



/ 
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Then, patterns m hand, you can get down to the work of 
actually fixing the data. As you’ll discover, this process is 
often iterative, meaning you will clean and restructure the 
data over and over again until you get the results you need. 


i^^rpen your pencil 



First, let’s split up the fields. Is there a pattern to how the fields 
are separated from each other? 


6 


a 0 C d _|_ F G - 

E »M H H¥4'ndr-hlPH.-i^r 1 ilu^ \U ^S^rnilDKJl S17 I7^S9^7 r y5Waiftl^ai I ? 

UJilSIl lH«4£ H 从 

U7it^A1^tMhRj| 101571 HFiuaiEnflll^mfna 15：?2 

lS^Jaco&VUC&IID lS6|Ckil0045vOI7-BO&-«lBWZ56ff01/03/DiEI i5m 
13 »r< 3 umsenMI i>-bba 彳 ID 1^3?^1001^17-«7-22&61^3£7#01/«/08 13:19 

I 口 ， i ■内 icn/p^/.D« m-.m 


■ 


cleaning data 


Once youYe organized, 
you caw fix the data itself 
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split the co/i/mns 


^Sharpen your pencil 

Solution 


What patterns did you find in the data? 


All *thc dd*ba is m Column /\ wi*th 七 he -fields mdshedi *toythev. Between eadh -field *thcv-c 
is a sir^le dharad*tev ： -the y>ouhd ( 养 ) . 
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n 


i/QUa/m fma 



Use the* sign as a delimiter 


Sfcdi-fy *t^c 
delW» 七伙 . 


Pirtviw ^if _ 4 ^_^|jLiLE 




|?Tt-^7«Eift|E4Pa?iP ITO 3IT| »r.|i 
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fi4*-D4«u»ii4nfTP SH£ 劍 

Ul| -^. 1 | 1 =. 1 11 ^ 122 * 13 , 3 -SJ 4 - 2 ifl 3 t*i ： &EitDl/DJ/&I 


r* H 


...and now you’ve started the Wizard. In the first step, 
tell Excel that your data is split up by a delimiter. In 
the second step, tell Excel that your delimiter is the # 
character. What happens when you click Finish? 


-^'Wfc-thctfe we r-dt'&K dtKTlm pw dortai 


i rour 
jik <a 


« l^nkaM 
r m * 』 
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-Jk3#cn 

一 IN*b 
-A lmcii. 


林 闕 , m 

M»m 1 H 

Piin h t i 


= !=■ 

LHi!i 


FtiWi 

s«_nn_!P5iM 

Pip- 63 i-;*pi 


口 _33JP 
ii ； 

i：U 

iiQiili 


M l *n 


LL12? 




Cpw 


Excel has a handy tool for splitting data into columns 
when the fields are separated by a delimiter (the 
technical term for a character that makes the space 
between fields). Select Column A in your data and 
press the Text to Columns button under the Data 
tab … 


Tell £ 乂匕 el 
you have a 
ddirwi-tev-. 


CofVEft 1 相 _D CDiUfTin5 W ELI id - fecp2 nf 3 




rhK pjrpvn rli 70^1 ifS iWmSiPrB fdtM i)onlJkR. 

nefTcw* brtw. 


!■ l^uJlFi 


■ 


F^hf fpsl Wuvrd I UK E|Mpw l #d EluC vP-i liiLn n. 
if Hk H.-miiT<E ? , HfiF^ ur uJ-puijp ENp iI^ji 




Ti^L^int^rbv^idcfrTilti i ， fine; 


frill-^uHn: 


Ccwtrt Icn lp £olufTin5 Wc^sid ■ ^cpl uf 3 


1 lhi££iPi;l dpw-^n, ^dli dJibi. 


Sclcd*t Column A 3 ir>d 
dlidk this button 


I ： 


Text to 


Lolumns 
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cleaning data 



'iswiifi 




4213 

■ft* 

i 255 


ll 如 91? 




Excel split your data into 
columns using the delimiter 

And it was no big deal. Using The data is how neatly 

Excel’s Convert Text to separated ih-to dolurwhs. 


Ko>w *thc ficdcs o-f da*ta avc 
scfav-a*tcd> you cby\ mahfula 七 e 
mdividually >-f you wa^*t *to* 


Column Wizard is a great 
thing to do if you have 
simple delimiters separating 
your fields. 

But the data still has a few 
problems. The first and 
last names, for example, 
both appear to have junk 
characters inside the fields. 
You’ll have to come up with a 
way to get rid of them! 


l/VV>a 七 av-c you 3 。叫 
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Sharpen your pencil 




What’s the pattern you’d use to fix the FirstName column? 
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kill the carat 

^Sharpen your pencil 
^ Solutian 


Is there a pattern to the messiness in the FirstName field? 


A*t cvev-y hdme is a A dharad*tcv-. Wt need 七 。 yt v-id o-f all o-f *them m order 

bo have ^ea*t -fiv-s*t ^a^cs. 


A 


FirstName 




This dhav-at*tcv- is 
cvcv-yy/V>cv-c| 


Let’s see what 
Excel has for us. 
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cleaning data 


——备 9 办轚塞- 

Match each Excel text formula with its function. Which function do 
you think you’ll need to use the clean up the Name column? 


FiND 

Tells you the length of a cell. 

LEFT 

Returns a numerical value for a number 

stored as text. 

IlIGHT 

Grabs characters on the right side of a cell. 

TKM 

Replaces text you don’t want in a cell with 
new text that you specify. 

LEN 

Tells you where to find a search string 
within a cell. 

CONCATENATE 

Takes two values and sticks them together. 

VALUE 

Grabs characters on the left side of a cell. 


SUBSTITUTE 

Removes excess blank spaces from a cell. 
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text formula spectacular 
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cleaning data 


Use SUBSTITUTE to replace 
the carat character 



To fix the FirstName field, type this formula into cell H2: 
^SUBSTITUTE (B2, 




Copy this 



□rtj 
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S ■* 


i* &<krtf 




Sdti^ 
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WlSairnl 



^Ini# n#n^IrtLI€kn 
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- 1 
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425i 
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1 鈐 ^uhHin Mbmic Uf|rd 
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4237 
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in Column H. What happens? 


Pu*t *b^c *fov"w\ula “ere. 


tKeretare no o 

Dumb Questi9ns 


Am I limited to just these formulas? 
What if I want to take the characters 
on the left and right of a cell and stick 
them together? It doesn’t look there’s a 
formula that does just that. 

There isn’t, but if you nest the text 
functions inside of each other you can 
achieve much more complicated text 
manipulations. For example, if you wanted 
to take the first and last characters inside of 
cell A1 and stick them together, you’d use 
this formula: 

CONCATENATE(LEFT(Al,1), 
RIGHT(Al,1)) 


So I can nest a whole bunch of text 
formulas inside of each other? 

You can, and it’s a powerful way to 
manipulate text. There’s a problem, though: 
if your data is really messy and you have 
to nest a whole bunch of formulas inside 
of each other, your entire formula can be 
almost impossible to read. 

Who cares? As long as it works, I’m 
not going to be reading it anyway. 

Well, the more complex your formula, 
the more likely you’ll need to do subtle 
tweaking of it. And the less readable your 
formula is, the harder that tweaking will be. 


Then how do I get around the 
problem of formulas that are big and 
unreadable? 

Instead of packing all your smaller 
formulas into one big formula, you can break 
apart the small formulas into different cells 
and have a “final” formula that puts them all 
together. That way, if something is a little off, 
it’ll be easier to find the formula that needs 
to be tweaked. 

You know, I bet R has much more 
powerful ways of doing text manipulation. 

It does, but why bother learning them? 
If Excel’s SUBSTITUTE formula handles 
your issue, you can save your self some time 
by skipping R. 
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fixed first names 


You cleaned up all the first names 


Using the SUBSTITUTE formula，you had Excel 
grab the A symbol from each first name and replace 
it with nothing，which you specified by two quotation 


marks (''")• 



uu 

42iS 

42S5 


i. 


Lots of different software lets you get rid of crummy 
characters by replacing those characters with nothing. 
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To make the original first name data go away 
forever copy the H column and then Paste 
Special > Values to turn these values into actual 
text rather than formula outputs. After that you 
can delete the FirstName column so that you 
never have to see those pesky A symbols again. 



u dar> delete a>way … as lov>^ as 
saved a dopy o\ *tV>c ov'i^mal 
Vile so you c^v\ v*c-fcv badk *to 
i*t i-f you mddc d mistake* 
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cleaning data 



Hmpf. That first name pattern was easy 
because it was just a single character 
at the beginning that had to be removed. 
But the last name is going to be harder, 
because ifs a tougher pattern. 



Let’s try using SUBSTITUTE again, this time to fix the last names. 


c 

Kasm us5ijl[> 1^ /]sn 
Rd^mu ： Hbi|tD9g]^ri ： 
war-c^iD S4) 

Co[ID15GIoTc 

KubM 汜 l^)id 
F«HT(IO 

Me(ID L32)dina 
Burn{ID S]b 
Rr-(\n 164].ird 
^w4riKf ID 7 } 


First, look for the pattern in this messiness. What would you tell 
SUBSTITUTE to replace? Here’s the syntax again: 


=SUBSTITUTE(your reference cell, 
the text you want to replace, 
what you want to replace it with) 

Can you write a formula that works? 
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tough pattern 

|harj»n your pencil 
k Solution 



l J^TNfllThP 

Kasmuf5!{ID 12/]sn 
R^sitiu^eiiIID 9g]^n ： 
warc{lDS^l} 

Cd[ID !5G]ok 
KubM 甿 1 州 
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BurnllD S]b 
Rf'Jm L 64 ]Ard 
Ew^KjlU 7} 
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lnggii( ID tfi3) m 


Could you fix the LastName field using SUBSTITUTE? 


SUBSTITUTE won’t work here! Every cell has different 
messy text. In order to make SUBSTITUTE work, you’d 
have to write a separate formula for each last name. 


SUBSTITUTE(C2, 

'' (ID 

127)",、'") 

SUBSTITUTE(C3, 

'' (ID 

98)",、'") 

SUBSTITUTE(C4, 

'' (ID 

94)", '、"） 


And typing a bajillion formulas like this defeats the purpose 
of using formulas to begin with. Formulas are supposed to 
save you the trouble of typing and retyping! 


The last name pattern is too 
complex for SUBSTITUTE 

The SUBSTITUTE function looks for a pattern 
in the form of a single text string to replace. The 
problem with the last names are that each has 
a different text string to replace. 



These s*bri 呼 

av-c 


Rasmuss(ID 98)en 
Co(ID 156)ok 



You da^*t jus 七 -type *m -the value 
you v*cfladcd, bcdausc 七 ha 七 
value dV>av>5cs -fvom dell *to dell 


And that’s not all: the pattern of messiness in the 
LastName field is more complex in that the messy 
strings show up in different positions within 

each cell and they have different lengths. _ 

The mess'mess sta\rb oy\ - 一 " \ (/ 

cMaratitr o( *tV>c dell... Rasmuss (ID 98 ) en 


"3hd hc\rc it s-tav-ts oy\ 
the thi\rd dhav-adtev*/ - 


Co(ID 156)ok 



^ The o( -this twt 

is seveh dhava^tevs. 

TWis erne is ci^*b 

d)iav-at*tcv-s \oy\^ 


402 


Chapter 13 












cleaning data 



Handle complex patterns 
with nested text formulas 

Once you get familiar with Excel text formulas, you 
can nest them inside of each other to do complex 
operations on your messy data. Here’s an example: 


The FIND *fo\rw\ula V"C*tu\nr\s 
d v-cpv-csc^*b 

position o( -the w ( w . 


FIND( n ( ， ' ， C3) 


LEFT (C3, FIND (▼，（' C3) -1) 


LEFT yabs i\\t 

七你 os 七七 wt. 


Rasmuss(ID 98)en 


Rasmuss(ID 98)en 


RIGHT(C3,LEN(C3)-FIND( ，，），，， C3)) 


CONCATENATE(LEFT(C3,FIND( n ( n ,C3)-1), 
RIGHT(C3,LEN(C3)-FIND( n ) ， ' ， C3))) 


The formula works, but there’s a problem: it’s starting to 
get really hard to read. That’s not a problem if you write 
formulas perfectly the first time, but you’d be better off 
with a tool that has power and simplicity, unlike this nested 
CONCATENATE formula. 


Rasmuss(ID 98)en 


Rasmussen 


RIGHT yra\)S i\\C 


CONCATENATE 

puts i*t all •fcoythev-. 


Wouldn t it be dreamy if there were an 
easier way to fix complex messes than with 
long, unreadable formulas like that one. But 
I know it's just a fantasy... 


you are here ► 
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regex in r 


R can use regular expressions to 
crunch complex data patterns 

Regular expressions are a programming tool 
that allows you to specify complex patterns to 
match and replace strings of text, and R has a 
powerful facility for using them. 

Here’s a simple regular expression pattern that 
looks for the letter “a”. When you give this pattern 
to R, it’ll say whether there’s a match. 


Gee| Bits 



Regular expressions are the ultimate 
tool for cleaning up messy data. 

Lots of platforms and languages 
implement regular expressions, even 
though Excel doesn’t. 
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cleaning data 



篇 RCui-(R CMHlfl) 


_ Ihk hH Vito MfiL Pitly^Di Wmduws Help 

_ if ^ 




> C- r6*d.c^( n hccp:,V 


I tEadSizaeLabsicoav tlia ii[a fez E. . »z5v n i tE*d*™rELL"Z | 


tiW 

nn3|cij^^ (ip SJT^r^ \\M7 
iBUdHI K 蠡甶 9l|«ai LDD1S *ti-l].2-T2»l 
ZJciU if ore i 1E» ^4) IDDll 4b 4 6-391^^9 j 6 


TiJK 

i^syyoai jh ： 3 J 
4113 1/3-/2090 13:17 

42：ji i/i/aocia ii：st 




Better get moving! Here goes: 


o 



head command. You can either save your Excel file as a CSV and load 
the CSV file into R, or you can use the web link below to get the most 
recent data. 


this! 


TKlS dommdir>di V*C 3 ds 

七 he CS\/ *m*to a 
•tabic called hfhh- 


❺ 


Run this regular expressions command 

NewLastName <- sub hfhh$LastName) 


o Then take a look at your work by running the head command to see 
the first few rows of your table. 

head(NewLastName) 


Wkat kappens? 


From: Head First Head Hunters 
To: Analyst 

Subject: Need those names NOW 

Well get on with it! Those prospects are 
hot and are only getting colder. I want 
our sales force to start calling people 
like yesterday! 


* T* 
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substitute with regex 


The sub command fixed your last names 


The sub command used the pattern you 
specified and replaced all instances of it with 
blank text, effectively deleting every parenthetical 
text string in the LastName column. 


Let’s take a closer look 
at that syntax: “ a _ U 

youv dleav>ed las 七 i^amcs. 



\trcs your 
C%pV"CSSlOir\ pa*t*tc\nr\ 





is is bldi^k 七以七， y/ni^l 

vcplatcs -that waives 
七 he pattev-h wi-th 


NewLastName <- sub (^\ \ ( .^\\) 


" "" 

r 


,hfhh$LastName) 


If you can find a pattern in the messiness of your data, 
you can write a regular expression that will neatly exploit 
it to get you the structure you want. 

No need to write some insanely long spreadsheet formula! 


- Regular E^resskn Up Cl^se 

Your regular expression has three parts: the left parenthesis, 
the right parenthesis, and everything in between. 



The Ic-Pi pa\rc^ihcsis 
(七 he backslashes av-c 
“escape" dhav-a^tev-s 

ihai icll R ihai 

"the paired-thesis 
is no 七 itsd-p diV\ 

K C>cp\rcssior>). 



EvcvytK'mg'm bc*t>wccv>- 


he v-ijh-t pav-c^hcsis. 








女 




4\ 


The do*t means 
w ar\y ^av-at*tcv-. w 




-The asicvisk mcair>s w ar>y v>umbcv- 
o( 七 he fvcdcdm^ thav-ad*tcv- w 
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cleaning data 


If regular expressions are so 
common in programming languages, why 
can’t Excel do regular expressions? 

On the Windows platform, you can 
use Microsoft’s Visual Basic for Applications 
(VBA) programming language inside of Excel 
to run regular expressions. But most people 
would sooner just use a more powerful 
program like R than take the trouble to learn 
to program Excel. Oh, and since VBA was 
dropped from the recent release of Excel for 
Mac, you can’t use regex in Excel for Mac, 
regardless of how badly you might want to. 



this! 



Add -the new LastName 
^tc{x>Y -feo hfhh. 


This -file v/ill be found’m youir 
R wovk'mj dived-tovy, v/hidh 
R v/ill tell you about v/i 七 h 
"the getwd () 


Some of those regular expression 
commands look really hard to read. How 
hard is it to master regular expressions? 

They can be hard to read because 
they’re really concise. That economy of 
syntax can be a real benefit when you have 
crazy-complex patterns to decode. Regular 
expressions are easy to get the hang of but 
(like anything complex) hard to master. Just 
take your time when you read them, and 
you'll get the hang of them. 

What if data doesn’t even come in a 
spreadsheet? I might have to extract data 
from a PDF, a web page, or even XML 




u 


Now you can ship the 
data to your client 


Better write your new 
work to a CSV file for 
your client. 


Remove old LastName vcd*tov 
-fvom i\\t hfhh data 


lA/v-'itc results 
{jo a CSV -f ile- 



Regardless of whether your client 
is using Excel, OpenOffice, or any 
statistical software package, he’ll 
be able to read CSV files. 








H Flit 


htt.t 


> hn-Jt 

J 14? IE 


> 



there ^ are no o 

Dumb Questions 


Those are the sorts of situations where 
regular expressions really shine. As long 
as you can get your information into some 
sort of text file, you can parse it with regular 
expressions. Web pages in particular are a 
really common source of information for data 
analysis, and it’s a snap to program HTML 
tag patterns into your regex statements. 

What other specific platforms use 
regular expressions? 

Java, Perl, Python, JavaScript... all 
sorts of different programming languages 
use them. 



Edit View 


l a—I »j -,■[ r||^ 
Pickftgcs Windows Mdp - f 


sSL&JtHiaine KUJ.L 
i [" I ： 胂 ewLiia LNoir.= 

■ 卢 . mv I h"fhh r f 11 f 辱 _ cav" 1 









I 


II 

11 
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duplicate data 



He’s got a point. Take “Alexia 
Rasmussen,” for example. Alexia 
definitely shows up more than once. It 
could be that there are two separate 
people named Alexia Rasmussen, of 
course. But then again, both records 
here have Person ID equal to 127, 
which would suggest that they are the 
same person. 

Maybe Alexia is the only duplicate 
name and the client is just reacting 
to that one mistake. To find out, you’ll 
need to figure out how you can see 
duplicates more easily than just looking 
at this huge list as it is. 



Maybe you're wot 
quite done yet... 

The client has a bit of a 
problem with your work. 




408 


Chapter 13 











cleaning data 


Sort your data to show 
duplicate values together 


If you have a lot of data, it can be hard to 
see whether values repeat. It’s a lot easier 
to see that repetition if the lists are sorted. 


Its havd *to see lists 


espedtally 
•i*f rt’s a lo% list. 



Unsorted 



Lots of 

v-epetitio^s hcv-c. 




Let’s get a better look at the duplicates in your list by sorting the data. 


In R, you sort a data frame by using the order function 
inside of the subset brackets. Run this command: 


h sov-ted 
dopy o-f you\r list 


<-hfhh[order(hfhh$PersonID),] 



Because the PersoniD field probably represents a unique number for 
each person, that makes it a good field to use it to sort. After all, it's possible 
that there’s more than one person in the data named “John Smith.” 

Next, run the head command to see what you’ve created: 


head(hfhhSorted, n=50) 


Wkat ctoes R cto? 
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sorting reveals duplicates 



ExeRciSe 

SotuiloH 


Did sorting your data frame in R by PersoniD reveal any duplicates? 


R Ear— » |1 f 

K hlB bft flap PmdkM^mh 

^piMfeioiiaw 




es-iJ m. m LsfbJi §» 

?Vi IrVda H™ 

isAID TlxwmmrnK UP IfciM 

1 "4^14 •& 

1 IXQt 

4 J'T^l 

1 J4ylM=ii IXQt 3711-1-1 «-& 

4 嚅 I# 

1 J4ylB lOOa% fclt-lV^-fl-HI-b 

增 i#P IK^i ， ， 1i^74r- 邮， 

n<sy=AiM laj±l l44> 

7 33J» I'iMIS-J'ilS 

t 1 ： J±1 

7 P^T 筹 M* IU7J Iff-3I3-3«S 

± 12l±jl - 31313 

7 p«{.-43^=> nm anf^aia-^is 

? 34J73 «H|f-se3-p^l3 

± AE^iO^ 

? Twi#- 1K9? "If-liC-«¥T| 

^ TMdr ： -Sift-Sit-Sit!. 

3 T«4t2 JK9? HT-llHiTl 

念 i-^ti ^i£.-Si&-ii^5. 

+ La# 3SJ?0 11 卜 

i IhA iiJlfi 

4 L» 1SJ1C 

1 Cji-h. 1jJ^4 

4 Li* 19.3?^ Tlf—i«4-^ST4 

* cj_ ima 

i r^P-a iwji 


CilSJ.Il> 


HE 


III 

L4l> 


•LWJW* 
la^^/josa 
1^1./JWV 

wiitmm n« 

T^l/aMS !iJ 1 ii 
W^ntKM Hl« 

Li^m 加 亡 im 
3j^I^?TO ViJ>» 
sii W 
waflM liM 
SHJ/a«* ±S\14 
Mfirnm ism 
T/Jl^iadlS 
mi〆 綱 ■ svi» 
T^li/Hdl iiisi 

t/iuavm gnn 

SHVMM MlM 
，綱 ， 》» Pl«« 
iinH/KAi MlM 

l^JT|. , J-lQ9i as ■■鵠 


lias: Lju^lHihx 

“is !■ ^EJifunm- 
SaiJT CZ<EliEcnpc± 
sm?9 WElpr^i9nif= 
2SS» CZ ： Elatvfi^c± 

S¥s-H KEiwtnne 
31 S 3 5 L£.Ei3>E>cnrc± 
ftrartint-fr 


Rh'tgi 

SMsi^J 

Pl»rai 

Si^fSLi 

S^i ^s Li 
Fr^i;3 si 
f=i±ai5± 
rr#j^3 a* 
f=A±£J^± 
Jh^n 
ib^tE 
咖 rc 
thdct 


n/c\>, -tV^cv-c arc lots ar\d 
lots <^f dufl^atc ” ames. 
Lots. _t a mess! 


LB- 

• : i:. 

•1 ： 

Cm 

&i. 


Lm 

«■ 



W，* 

碥⑶ / 咖 1 ! 

SiAiN 


■Mf — 

4 1 

4*___ 


SllO% 

Glrmv-p 


•/冰卿疆 

，•私 

itlAWfi 


4IHT 

Kviivinm 

_ic>» 

Slawi 

im§ 

4d>ii 

t/fc / 你 * 

|A|a«. 

A, 

A4^i21^TM^W 

•1G7 & 

SI iJO 

GSan-E 

|«i ■ 

411* 

U/W 挪 1 

*l»i- 

ip&4¥#P 

■ MmiTJi 

l]4T 


JQiIO 

GSavf ■ 

lit 4a* m 

lL«t 

： M2 ， m 加 

^1.11^ 



iL« 


10 i IJ 

G!lO¥HI.E 


I^iir_ ㈡ 扣矛 1d44-> 〒 4^ 
j'finr^disrc, ir>sac j-a* iai s-s» 

usac 


i I4J.£V=J 

3 CHL£Jn^ 

* iWTK 

9 D4s3«^ 

V 

gf UU=K 

* ll!lfi 
4arw I33¥l 


aS-s4* 卜 & 

10320 J13rJZD-3J ： l, 
J3-UC 

JIS-J|2D'3J2L 

lOl^D J1£-JJ!D-]I3.I. 
IVUM 

1^]|^E ili-J 叩 -mi 


&lila 

L/T/2C4Q 311» 
4^1^444* Sl#1 
S3 I ㈣ 

奸 《 /S«» t4l4» 
U3/2^in 3>i:SS 

t/HM* Urrtl 

9^1/2>33 SlsJS- 
ifM/dMI iii-U 
Sri VlIdOJa U S E-2 

Vi'lljra-Kiai 11 i 

vi_n iim 
wi«P Eaut 




When you get messy data, you should sort 
liberally. Especially if you have a lot of records. 
That’s because it’s often hard to see all the data at 
once, and sorting the data by different fields lets you 
visualize groupings in a way that will help you find 
duplicates or other weirdness. 
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如 1 e£ lC -：^ 
It 004 
^*vln m:4 


riMC-9iTKr-44RF 

S1i-3lB^-i1il 


«*uz» 

i7H4 

i4ttr VMiifliS 

H 3 

149 杰 l^4r^2tl^ 
sciea 麗 〆 


Tih 

11: M，■,rl >pt#««■■-. 
lC=l? ^ZLflxIcnaEf. 
11：=>£4 %rir?<f1PK 
li:S4 iLSia-EcnaEi 
ll^n Arivptnpn 

113 SJ ^=i3 Vdl 躧 E=> 




cleaning data 



-4k. 




rpen your pencil 


Take a close look at the data. Can you say 
why the names might be duplicated? 


1 /Vv*rtc youv 扣 sv/cv i^cv-c 



□ 
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rdbms returns 


Sharpen your pencil 
(Solutlan 


Why do you think the same names show up repeatedly? 



|-f you look a 七七 he -far \rijlvt dolumh, you ddh 


tt 七 ha 七 *thcvc is a dd*bd foih 七 Uhi^uc *bo eddh 
ttoY&' a *time stanap of a phor\c dall Tha*t 
robably medr\s *tha*t cadh of lmes m -this 
aiabase rcprcschis a phohc ca\\, so -the ha^es 
epeat because -there arc multiple dalls -to -the 


san«c people. 


The data is probably from 
a relational database 

If elements of your messy list repeat, 
then the data probably come from a 
relational database. In this case, your 
data is the output of a query that j 

consolidated two tables. 

Because you understand RDBMS —— 

architecture, you know that repetition _ 

like what we see here stems from 

how queries return data rather 

than from poor data quality. So — 

you can now remove duplicate names 
without worrying that something’s 
fundamentally wrong with your data. 



The o\ri^'mdl dd'tdbdse 
-fov 七 his da 七 a rwighi 
have looked like 七 his. 




Person 


PersonID 

FirstName 

LastName 

Etc... 


PhoneCalS 


PhoneCalllD 

PersonID 

CallDate 

Etc... 




iVho ky>oy/s 七 

0*tV>CV stu-f-f V/3S 


•m -the database? 
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cleaning data 


Remove duplicate names 

Now that you know why there are duplicate names, 
you can start removing them. Both R and Excel 
have quick and straightforward functions for 
removing duplicates. 



So now that you have the tool you need to get rid 
of those pesky duplicate names, let’s clean up your 
list and give it back to the client. 




Create a new data frame to represent your unique records: 


hfhhNamesOnly <- hfhhSorted 


YjK ymiv Jafa ance 
ana fe a]] 式 … 

+命 



Remove the Call ID and Time fields, which the client doesn’t need 
and which are the cause of your duplicate names: 


o 

o 


hfhhNamesOnly$CallID <- NULL 
hfhhNamesOnly$Time <- NULL 

Use the unique function to remove duplicate names: 
hfhhNamesOnly <- unique(hfhhNamesOnly) 


Wcv-cs unique athoJ 


Take a look at your results and write them to a new CSV: 


head(hfhhNamesOnly, n=50) 

write . csv (hfhhNamesOnly, file=''hfhhNamesOnly. csv〃) 
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keep it tight 



： 


hfl B JhMaMa€ ! aJIy 

nrh^rjnzE-?i!< 7 = , i 111 p rimJ m 

? -E- } 01 .« 

> |^|l , ^LMa£K 7 C^i].Y' f^r^i^sKc 

S FiP-Kd I h f F|Ul abP-#O fL 1V 4 ri*Si{l| 

Fes-sacID Fir^EEIiac ZIP 


PJiQse 


I 1 . 9 EH 


171 1 




ESA 


ms 


2 aeiTUlda 11211 C1^-i22-3«31 J[*rrU 

3 If 柏 扣 1 10091 Tlfl-fil3-iST5 rmfcUA 

i 112 JU ^hQi% 

5 Cpit4 l^Q2^ 

a 1 1^33 

t u«queiln.c IfliOfi S i 11 l ^1l!l-&--54Ji!l Swing 

a i^rrt；^ 101SD ju-pjg-iail Soffit 

9 uas>l aia-UI-ATafli Hmv^i 

ig 111 科 PEgc=#-s 

11 turn iflooi 

12 Azigc-l^q^c 11413 917-2 Sl-DSJU Glrvcr 

Meiwisi iDsei sit-es^-s-iis BM4^r 

lit i7«rfiiUBii 1U21 «4G-22t-St4l 

li Om Hl«3 5T^lt 

1 ‘ 14t 个 A ㈣ 4 叆占 

17 Cami IflOCff ei>ifr-?sa-1.792 De-iavu 

ii >J4^he iitiv 

15 JiAlwbdeE 1HQ1I 317-a^i-BiliA Kkxra^^ 

33 S«4 i51«3 «IS-^|T-f3SE n^y^i4« 

a?. I'ip-^ii-MiD 

22 3fAdflJ9Hll lilTl 91„7>-&2】-9292 AHb«C 

31 3 ^fLfi J 

Zi £fliu.'j'Bii ilioa 1 , lfl-6d4-?^i5 Cnae 

?$ PQidilll U4ti mi-5B-i-»a03 

a jav^ba Mtia zis-i^a-iau kahu^ 

: P AtdCfi 11 

3_ fNirsn^^i 
Z9 jU«XGrkder IfllTD C±t^lu 

时 SiiepEM-? I04T5 Cr&^y 

31 hB^SLl^ 11237 TlE-aa j l-1'iiJI Scluil^ 

32 Jl 钱 《 113DI liJ-SHi-llltl Re?h 

ILX« 

AdrlA!3-^ H103Z Clfi-S4Hs-S3IS Canir*rs 

S-& iflOsO Jii'&SS-Sitl r#fiftr 

3^ K«&lELCt 11414 Slt-^HO-TitS fiBYd 

3 1 ? IflUS HUSH—BJI! & 5 iFi« 

31 tuyji^i 1U43 ttQ 1 

H DufeElJ 1143( C41-C ： K2-fta49 : r ： f^v 

蠲 4 ，电 _• 

号 1 iflj&s nii^iziji 

42 3^u£.r^ IflSti 3lS-??5'51i£3 

43 y L !l L ^ I! - a 觀 . ： : 9UCAAf d 

Cia-a 1{]QID 917-7 38-1113 2ruME 

CsaJLtl Kupn 

=3fi rittnrL« 1U411 iU-HO-ifiSli frflis 

.， Be • ㈣ lim s ，， _ ， ㈣ -34,7 

IB IU4fii 212-4 aS-Hlli £mtta 

ig ^ij.u^i ihq^s Tis-s-ii-uTu 开《；|»均 

1 ^ 131 H^C-tPiirv 3 |i 


> untie „c jv f^OiiK 

1 


jyacH?! 

tur.efDn 


Ccitjr r i li&w-bi^BariieaCnly. c a?' 


You created nice, clean, unique records 


This data looks totally solid. 

No columns mashed together, no funny 
characters, no duplicates. All from 
following the basic steps of cleaning a 
messy data set: 



Use your finished 
data. 
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cleaning data 


Head First Head Huwtcrs is 
recruiting like gangbusters! 


Your list has proven to be incredibly 
powerful. With a clean data set of live 
prospects, HFHH is picking up more 
clients than ever, and they’d never have 
been able to do it without your data 
cleaning skills. Nice work! 
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go forth and analyze 


Leaving toww... 



Ifs been great having you here iw Pataville! 


We’re Sad to see you leave, but there’s nothing like taking what you’ve learned 
and putting it to use. You’re just beginning your data analysis journey, and we’ve put you in 
the driver’s seat. We’re dying to hear how things go, so drop us a line at the Head First Labs 
website, www.headfirstlabs.com, and let us know how data analysis is paying off for YOU! 
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, The Top Ten Things ^ 

(we didn’t cover) 



You’ve come a long way. 


But data analysis is a vast and constantly evolving field, and there’s so much left the learn. 
In this appendix, we’ll go over ten items that there wasn’t enough room to cover in this 
book but should be high on your list of topics to learn about next. 


this is a new chapter 
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you need more stats 


共 1: Everything else in statistics 


Statistics is a field that has a huge array of 
tools and technologies for data analysis. It’s so 
important for data analysis, in fact, that many books 
about “data analysis” are really statistics books. 

Here is an incomplete list of the tools of statistics 
not covered in Head First Data Analysis. 



I 七 ’s a ^v*ca*t idea leam abou*t all o( 

-tKcsc ■tof ids i-f youVc 3 d3*t3 analyst 


Surveys 
Confidence intervals 
Standard error 
Sample averages 


Tests of significance 


Sampling 


y \ o {, Covered m 

Head Pi\rs*t Data /Wlysis 


f 

The multiplication rule 

、、、、 

Independence 

i 

Probability 

The binomial formula 

V 

_ 」 


The null and the alternative 
t-Test 

Chi Squared Test 
z-Test 



The law of averages 
Probability histograms 
The normal approximation 
Box models 


Lots and lots of other stuff... 


Much of what you have learned in this book, however, 
has raised your awareness of deep issues involving 
assumptions and model-building, preparing you 
not only to use the tools of statistics but also to 
understand their limitations. 

The better you know statistics, the more likely you 
are to do great analytical work. 
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# Z: Excel skills 


This book has assumed that you have basic spreadsheet 
skills, but skilled data analysts tend to be spreadsheet 

ninjas. 

Compared to programs like R and subjects like 
regression, it’s not terribly hard to master Excel. And 
you should! 



The bcs*t daia analysis tar\ do 
sfvcadsKcc*ts sleep. 


o 


o 





9 % % % % % % 
1 s 0 4 2 4 
5 s 1 1 7 

B % % % % ^ % 
_ 2 6 13 3 5 
9 4 a 3 6 
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curator of cognitive art 


Edward Tufte awd his 
principles of visualization 

Good data analysts spend a lot of time reading and 
rereading the work of data great analysts, and Edward 
Tufte is unique not only in the quality of his own work but 
in the quality of the work of other analysts that he collects 
and displays in his books. Here are his fundamental 
principles of analytical design: 

“Show comparisons, contrasts, differences.” 

“Show causality, mechanism, explanations, systematic structure.” 

“Show multivariate data; that is, show more than 1 or 2 
variables.” 

“Completely integrate words, numbers, images, diagrams ■” 
“Thoroughly describe the evidence.” 

“Analytical presentations ultimately stand or fall depending on 
the quality, relevance, and integrity of their content ■” 

— Edward Tufte 


These words of wisdom, along with much else, are from pages 
127 ， 128 ， 130, 131 ， 133, and 136 of his book Beautiful Evidence. 
His books are a gallery of the very best in the visualization of 
data. 

What’s more, his book Data Analysis for Public Policy 
is about as good a book on regression as you’ll 
ever find, and you can download it for free at this 
website: http:/ / www.edwardtufte.com/tufte/dapp/. 
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# 4: PivotTables 


Pivot tables are one of the more 
powerful data analysis tools built into 
spreadsheets and statistical software. 
They’re fantastic for exploratory data 
analysis and for summarizing data 
extracted from relational databases. 
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r to the max 


The R community 


R isn’t just a great software program, it’s 
great software platform. Much of its power 
comes from a global community of user and 
contributors who contribute free packages with 
functions you can use for your analytic domain. 

You got a taste of this community when you ran 
the xyplot function from the lattice, a 
legendary package for data visualization. 


Youv *ms*tcllla*tio^ of R t^Y\ iiavc 
a 朽 Y tomb'rna*tioir> o( 



422 


appendix i 






























leftovers 


# 6: Nonlmear and multiple regression 


Even if your data do not exhibit a linear 
pattern, under some circumstances, you 
can make predictions using regression. One 
approach would be to apply a numerical 

transformation on the data that 
effectively makes it linear, and another way 

would be to draw a polynomial rather 

than linear regression line through the dots. 

Also, you don’t have to limit yourself to 
predicting a dependent variable from a 
single independent variable. Sometimes 
there are multiple factors that affect 
the variable, so in order to make a good 
prediction, you can use the technique of 
multiple regression. 




y = a + bx 

( Y^>u use this c^uatio^ *to 

^^ pv-cdid*t a vav-iablc -Pv-om 

a variable. 


But you also v/v-i-tc c^uatio^ -that pv^di6*b a 

广一一一 S . 

y = a + bx 1 + cx 2 + dx 3 + ... 



This is ^o\r 


multiple v-cjv-cssio^. 
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randomness and more hypotheses 


# 7: Null-alternative hypothesis testing 


While the hypothesis testing technique 
you learned in Chapter 5 is very general 
and can accommodate a variety of 
analytical problems, null-alternative 
testing is the statistical technique many 
(especially in academia and science) have 
in mind when they hear the expression 
“hypothesis testing.” 

This tool is used more often than it’s 
understood, and Head First Statistics is a 
great place to start if you’d like to learn it. 



Randomness 


Randomness is a big issue for data 
analysis. 

That’s because randomness is hard 
to see. When people are trying to 
explain events, they do a great job at 
fitting models to evidence. But they do 
a terrible job at deciding against using 
explanatory models at all. 


If your client asks you why a specific 
event happened, the honest answer 
based on the best analysis will often be, 
“the event can be explained by random 
variations in outcomes.” 



I never know what this guy 
has in store for me. He breaks 
every model I try to fit to his 
behavior. I wish I spoke English... 


424 


appendix i 






leftovers 


# 9: Google Poes 


We’ve talked about Excel, OpenOffice, and R, but Google 
Docs definitely deserves an honorable mention. Not 
only does Google Docs offer a fully functioning online 
spreadsheet, it has a Gadget feature that offers a large 
array of visualizations. 


It’s -fco OYt 

"the di-P-Pcvc^-t charts 

-thai you da 灼 do 
v/rth qoo^le Pods. 


Add a Gadget 

m 

Charts 

Tabltra 

Web 

Dintifflms 


Finuncu 

CusLon... 

Have a botlor idoa? 

Wnbe your owi gadg ct Id 
data in cooi new 
ways. Wmi Sts ymjr 
this Jist? 

Submil iL io u? using the 
^■ubmissipn ^jrm. 


/ou make a lot o-f 说⑽七 v,sual*»z^t»ons 
us'm^ -tv^c ^adje-t -PcaWc *m ^oo^lc Po^J 


!>s. 



S&atUr Chart 

By fifjogjn 

scnhftr dnaei Fiml ^nliimn fsr X. 
following cdumrre 

f Add lo vpreh*di.hH»l 



of !fufe «i lu dniciar^ 

1 ^-cAivr ujpn|jpnai or Lqi 

PlHHr 


tS« u .4u|ml 

fit ^ 『■*■，％■>! frfi-i 


What’s more, Goolge Docs has a variety of functions that 
offer access to real-time online data sources. It is free 
software that’s definitely worth checking out. 
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you 


# 10: Your expertise 

You’ve learned many tools in this book, but 
what’s more exciting than any of them is 
that you will combine your expertise in 

your domain of knowledge with those 
tools to understand and improve the world. 

Good luck. youv c%pcv**tisc and r\t^i ay>aly*tital "tools 

广一一 ， 
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appendix ii ： instcill x 

本 Start R up! ♦ 



Behind all that data-crunching power is enormous complexity. 

But fortunately, getting R installed and started is something you can accomplish in just a 
few minutes, and this appendix is about to show you how to pull off your R install without 
a hitch. 
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get excited about your new software 


fret started with R 


Installing the powerful, free, open source 
statistical software R can be done in 
these four quick and easy steps. 

o Head on over to www.r-project.org 

to download R. You should have no 
problem finding a mirror near you 
that serves R for Windows, Mac, and 
Linux. 


Clidk 七 iVis doy/r>load I'mk. 


^ - 






4il F ij i 




.i— 



The R Project fozr Statistical 


jh^Hliiil rla 

2MJLK 

Varmf-rfe 

CUN 


Iwblii 

!De^tIciic 

Cnri^imc 极 

： i<nck 

lLLs 



Setting Started: 




: md ODi 
Hvn-pi 


a&ii 

■pnfum>d 


HI MC 


ua 


K ■■疆 hrfri inT^mn bn ^mdu ： M£^ ^ ^4PsBir: 

vpfifiA Vi'm^on AdMA.-CJP^ Zq |l. bmuwi 

■#■ t E^n j^nl ^ Uc^ B|c 譬 ： u dank«d wad wml,^4 wtmmr, w «t|^ lir hcTOM r um 

iff?, ptw KMd u •VLiwn 的 i^Civ. 4 ■ 令 ■> ¥» hp^_ 




■flh >ibpi 3- w^ i bkw Om 


Knjy\ 


❺ Once you’ve downloaded the program file for R, 
double-click on it to start the R installer. 


ttcv-c^s *thc R *ms*tallcv- Vmdov/ 



is is *thc R 
oyarw -file- 


Clidk 
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install r 




Accept all the default options for loading R by clicking Next through 
these windows, and let the installer do its work.. 


Just au^i all the 
default ^oh-figuv-atiohs 
(or R by dlidihg H. 




I/Vaitmj is -the havdes-t pav-t 


Click the R icon on the desktop or Start 
Menu, and you’re ready to start using R. 


Wcrts Y/Ka*b R wmdoy/ 
looks like y/iicir\ you s*tav * 七 
\i (or *tV^c -f iv-s-b time. 
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The ToolPak ♦ 



Some of the best features of Excel aren’t installed by default. 

That’s right, in order to run the optimization from Chapter 3 and the histograms from 
Chapter 9, you need to activate the Solver and the Analysis ToolPak, two extensions 
that are included in Excel by default but not activated without your initiative. 
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installing the toolpak 


Install the data analysis tools in Excel 


Installing the Analysis ToolPak and Solver 
in Excel is no problem if you follow these 
simple steps. 


This IS 七 he 七 Button 


o Click the Microsoft Office Button and select 

Excel Options. 


❺ Select the Add-Ins tab and click Go, 
“Manage Excel Add-Ins. : 



TV^c Add-ks tab 


Clide "this biA*tfeo 灼 . 
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install excel analysis tools 


❺ 


Make sure that the Analysis ToolPak and the Solver 
Add-in boxes are checked, and then press OK. 


Make suve -that -these -tv/o 
bovecs a\rc C\\ttVtA- 


o 


Take a look at the Data tab to make sure that the 
Data Analysis and Solver buttons are there for you 
to use. 

Make suv-c these bu*t*boir\s 匕 an be 

sccy\ u^dlcv- *thc Data -tab. 


Thafs it! 

Now you’re ready to start running 
optimizations, histograms, and much 
more. 
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Numbers 

3D scatterplots, 291 

Symbols 

〜 not (probability), 176 
<-assign (R), 413 
\ escape character, 406 
I given (probability), 176 
I output (R), 380 

* regular expressions wildcard, 406 
.regular expressions wildcard, 406 
? topic information (R)，404 

A 

accuracy analysis, 172—174, 185-188, 214, 248, 300, 350 
Adobe Illustrator, 129 
algorithm, 284 

alternative causal models, 131 
analysis 

accuracy, 172—174, 185-188, 214, 248, 300, 350 
definitions of, 4, 7, 286 
exploratory data, 7, 124, 421 
process steps, 4, 35 
step 1: define, 5—8 
step 2: disassemble, 9—12, 256-258 
step 3: evaluate, 13—14 
step 4: decide, 15—17 
purpose of, 4 

Analysis ToolPak (Excel), 431—433 


“anti-resume，” 25 

arrays (lattices) of scatterplots, 126, 291, 379—381 

association 

vs. causation, 291 
linear, 291-302 

assumptions 

based on changing reality, 109 
baseline set of, 11,14 
cataloguing, 99 

evaluating and calibrating, 98-100 
and extrapolation, 321-324 
impact of incorrect, 20—21 ， 34 ， 100, 323 
inserting your own, 14 

making them explicit, 14, 16, 27, 99, 321-324 
predictions using, 322—323 
reasonableness of, 323-324 
reassessing, 24 

regarding variable independence, 103 
asterisk (*), 406 
averages, types of, 297 
=AVG0 in Excel/OpenOffice, 121 

B 

backslash (\), 406 

baseline expectations, 254 
(see also assumptions) 
baseline (null) hypothesis, 155 
base rate fallacy, 178 

base rates (prior probabilities), 178-189 
Bayes’ rule and, 182—189, 218 
defined, 178 

how new information affects, 185—188 
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Bayes’ rule 

effect of base rate on, 182-189, 218 
overview, 169, 182—183 
revising probabilities using, 217—223 
theory behind, 179—181 

Beautiful Evidence (Tufte), 420 

Behind the Scenes 

R.M.S. error formula, 338 
R regression object, 306 

bell curve, 270 
blind spots, 25-27 

Bullet Points 

client qualities, 6 

questions you should always ask, 286 
things you might need to predict, 286 

c 

candidate hypothesis, 155 
cataloguing assumptions, 99 
causation 

alternative models, 131 

vs. association, 291 

causal diagrams, 46, 48 

causal networks, 148, 149 

flipping cause and effect, 45 

and scatterplots, 291-292 

searching for causal relationships, 124, 130 

chance error (residuals) 
defined, 330 

and managing client expectations, 332—335 

and regression, 335 

residual distribution, 336—337 

(see also Root Mean Squared [R.M.S.] error) 

Chance Error Exposed Interview, 335 
charting tools, comparing, 129, 211 
cleaning data (see raw data) 


clients 

assumptions of, 11 ， 20, 26 ， 196-198 
communication with, 207 
as data, 11, 77—79, 144, 196 
delivering bad news, 60—61，97 
examples of, 8, 16, 26 

explaining limits of prediction, 322, 326, 332—335, 
356-357 

explaining your work, 33—34, 94—96, 202—204, 248 
helping them analyze their business, 7, 33—34 ， 108, 
130, 240, 382 

helping you define problem, 6, 38, 132, 135, 232, 
362 

visualizations, 115, 206, 222, 371 
listening to, 132 ， 135, 313, 316, 388 
mental models of, 20, 26 
professional relationship with, 14, 40, 327 
understanding/analyzing your, 6, 119, 283 

cloud function, 114, 291 

code examples (see Ready Bake Code) 

coefficient 

correlation (r), 300-303, 338 
defined, 304 

“cognitive art,” 129, 420 

comparable, defined, 67 

comparisons 

break down summary data using, 10 

evaluate using, 13, 73 

of histograms, 287-288 

and hypothesis testing, 155, 158-162 

and linked tables, 366 

making the right, 120 

method of, 42, 58 

multivariate, 125-129, 291 

and need for controls, 58-59 

and observational data, 43, 47 

of old and new, 221-222 

with RDBMS, 382 

valid, 64, 67—68 

visualizing your, 72, 120—123 ， 126, 288—293 
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=GONGATENATE0 in Excel/OpenOffice ， 398, 399, 
403 

conditional probabilities, 176—177, 182 
confounders 

controlling for, 50, 63—65, 67 
defined, 47 

and observational studies, 45, 49 
constraints 

charting multiple, 84—87 
defined, 79 

and feasible region, 85, 87 

as part of objective function, 80—82, 100 

product mixes and, 83 

quantitative, 100 

in Solver utility, 92—94 ， 104—106 

contemporaneous controls, 59 
control groups, 58-59, 62-67 
controls 

contemporaneous, 59 

historical, 59, 66 

possible and impossible, 78-79 

Convert Text to Column Wizard (Excel), 394—395 

corO command in R, 301—302 

correlation coefficient (r), 300—303, 338 

=GOUNTIF0 in Excel/OpenOffice, 368 

CSV files, 371,405, 407 

curve, shape of, 266—270 

custom-made implementation, 365 

D 

data 

constantly changing, 311 
diagnostic/nondiagnostic, 159-162 
distribution of, 262 

dividing into smaller chunks, 9—10, 50, 256, 271-275, 
346-348 

duplicate, in spreadsheet, 408-413 


heterogeneous, 155 
importance of comparison of, 42 
messy, 410 

observations about, 13 
paired, 146, 291 
quality/replicability of, 303 
readability of, 386, 399 
scant, 142, 231—232 

segmentation (splitting) of ， 346—348, 352, 354 

subsets, 271—276, 288 

summary, 9-10, 256, 259-262 

“too much，” 117—119 

when to stop collecting, 34, 118-120, 286 

(see also raw data; visualizations) 

data analysis (see analysis) 

Data Analysis for Public Policy (Tufte), 420 

data analyst performance 
empower yourself, 15 
insert yourself, 14 
not about making data pretty, 119 
professional relationship with clients, 14, 40, 327 
showing integrity, 131， 327 

data art, 129 

databases, 365 
defined, 365 

relational databases, 359, 364—370 
software for, 365 

data cleaning (see raw data) 
data visualizations (see visualizations) 
decide (step 4 of analysis process), 15-17 
decision variables, 79-80, 92, 233 
define (step 1 of analysis process), 5-8 
defining the problem, 5—8 
delimiters, 394-395 
dependent variables, 124, 423 
diagnosticity, 159-162 

disassemble (step 2 of analysis process), 9—12, 256-258 
distribution, Gaussian (normal), 270 
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distribution of chance error, 336 
distribution of data, 262 
diversity of outcomes, 318, 328—329 
dot (.), 406 

dot plots, 206 

(see also scatterplots) 
duplicate data, eliminating, 408-413 


editO command in R, 264 

equations 
linear, 304 

multiple regression, 423 
objective function, 81 

regression, 306, 308—310, 318, 321—326, 356 
slope, 305, 308 

error 

managing, through segmentation, 346—348 
quantitative, 332—338 
variable across graph, 344-345 
(see also chance error; Root Mean Squared [R.M.S.] 
error) 

error bands, 339-340, 352 
escape character (\), 406 
ethics 

and control groups, 59 

showing integrity toward clients, 131, 327 

evaluate (step 3 of analysis process), 13—14 
evidence 

diagnostic, 159-162 
in falsification method, 154 
handling new, 164-166, 217-223 
model/hypothesis fitting, 144—145 


charting tools in, 129, 211 ， 260—262 

Chart Output checkbox, 261 

=GONGATENATE0 formula, 398, 399, 403 

Convert Text to Column Wizard, 394—395 

=GOUNTIF0 formula, 368 

Data Analysis, 260 

=FIND0 formula, 398, 403 

histograms in, 260—262 

Input Range field, 261 

=LEFT0 formula, 398, 399, 403 

=LEN0 formula, 398, 403 

nested searches in, 403 

no regular expressions in, 407 

Paste Special function, 400 

pivot tables in, 421 

=RANDQ formula, 68 

Remove Duplicates button, 413 

=RIGHT0 formula, 398, 399, 403 

Solver utility 

Changing Cells field, 93 
installing/activating, 431—433 
Target Cell field, 92, 93 
specifying a delimiter, 394 
standard deviation in, 208, 210 
=STDEV0 formula, 208, 210 
=SUBSTITUTEO formula, 398-402 
=SUMIF0 formula, 369-370 
text formulas, 398—402 
=TRIMQ formula, 398 
=VALUE0 formula, 398 
experiments 

control groups, 58-59, 62-67 

example process flowchart, 71 

vs. observational study, 43, 54, 58—59 

overview, 37 

randomness and, 66-68 

for strategy, 54, 62—65 

exploratory data analysis, 7, 124, 421 
extrapolation, 321—322, 326, 356 


Excel/OpenOffice 

=AVG() formula, 121 
Bayes’ rule in, 220 
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F 

false negatives, 176—181 
false positives, 175-181 

falsification method of hypothesis testing, 152-155 

fast and frugal trees, 239, 242, 244 

feasible region, 85, 87 

=FIND0 in Excel/OpenOffice, 398, 403 

Fireside Ghat (Bayes 5 Rule and Gut Instinct), 218 

flipping the theory, 45 

frequentist hypothesis testing, 155 

G 

Gadget (Google Docs), 425 
Galton, Sir Francis, 298 
gaps 

in histograms, 263 
knowledge, 25-27 

gaps in histograms, 263 
Gaussian (normal) distribution, 270 
Geek Bits 

regex specification, 404 
slope calculation, 308 

getwdO command in R, 371, 407 

Google Docs, 425 

granularity, 9 

graphics (see visualizations) 
graph of averages, 297-298 
groupings of data, 258-266, 269—270, 274 

H 

head。command in R, 291—292, 372, 405 
Head First Statistics, 155, 424 
helpQ command in R, 267 


heterogeneous data, 155 
heuristics 

and choice of variables, 240 
defined, 237 

fast and frugal tree, 239, 242, 244 
human reasoning as, 237-238 
vs. intuition, 236 
overview, 225, 235—236 
rules of thumb, 238, 244 
stereotypes as, 244 

strengths and weaknesses of, 238, 244 
histO command in R, 265-266, 272 
histograms 

in Excel/OpenOffice, 260—262 

fixing gaps in, 263 

fixing multiple humps in, 269—276 

groupings of data and, 258-266, 269—270, 274 

normal (bell curve) distribution in, 270 

overlays of, 288 

overview, 251 

in R ， 265-268 

vs. scatterplots, 292 

historical controls, 59, 66 

human reasoning as heuristic, 237-238 

hypothesis testing 

diagnosticity, 159-162 
does it fit evidence, 144-145 
falsification method, 152-155 
frequentist, 155 
generating hypotheses, 150 
overview, 139 
satisficing, 152 

weighing hypotheses, 158-159 



Illustrator (Adobe), 129 
independent variables, 103, 124 
intercepts, 304, 307, 340 
internal variation, 50 
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interpolation, 321 
intuition vs. heuristics, 236 
inventory of observational data, 43 
iterative, defined, 393 



jitterQ command in R, 372 


K 

knowledge gaps, 25-27 



lattices (arrays) of scatterplots, 126, 291, 379-381 

=LEFT0 in Excel/OpenOffice, 398, 399, 403 

=LEN0 in Excel/OpenOffice ， 398, 403 

libraryO command in R, 379—380 

linear association, 291-302 

linear equation, 304 

linearity, 149, 303 

linear model object (lm), 306, 338, 340 

linear programming, 100 

linked spreadsheets, 361, 366, 369—371, 374 

linked variables, 103, 146-148 

ImO command in R, 306-309, 338, 340, 353-354 

M 

measuring effectiveness, 228-232, 242, 246 
mental models, 20—27 ， 150-151, 311 
method of comparison, 42, 58 
Microsoft Excel (see Excel/OpenOffice) 


Microsoft Visual Basic for Applications (VBA), 407 

models 

fit of, 131 

impact of incorrect, 34, 97-98 
include what you don’t know in, 25—26 
making them explicit, 21， 27 
making them testable, 27 
mental, 20—27, 150-151, 238, 311 
need to constantly adjust, 98, 109 
segmented, 352 
statistical, 22, 27, 238, 330 
with too many variables, 233—235 

multi-panel lattice visualizations, 291 

multiple constraints, 84-87 

multiple predictive models, 346 

multiple regression, 298, 338, 423 

multivariate data visualization, 123 ， 125—126, 129, 291 

K 

negatively linked variables, 103, 146—148 

networked causes, 148, 149 

nondiagnostic evidence, 160 

nonlinear and multiple regression, 298, 338, 423 

normal (Gaussian) distribution, 270 

null-alternative testing, 424 

null (baseline) hypothesis, 155 

0 

objective function, 80—82, 92, 233 
objectives, 81, 92, 99, 118-120, 233 
“objectivity，” 14 
observational studies, 43, 45, 59 
OpenOffice (see Excel/OpenOffice) 
operations research, 100 
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optimization 

and constraints, 79, 100 ， 103-105 
vs. falsification, 155 
vs. heuristics, 236—238 
overview, 75 

solving problems of, 80-81, 85, 90 
using Solver utility for, 90—94, 106—107 

orderO command in R, 409 

outcomes, diversity of, 318, 328—329 

out-of-the-box implementation, 365 

overlays of histograms, 288 

P 

paired data, 146, 291 

perpetual, iterative framework, 109 

pipe character (I) 
in Bayes’ rule, 176 
in R commands, 380 

pivot tables, 421 

plotO command in R, 291-292, 372 
polynomial regression, 423 
positively linked variables, 146—148 

practice downloads (www.headfirstlabs.com/books/hfda/) 
bathing_friends_unlimited.xls, 90 
hfda_ch04_home_page 1 .csv, 121 
hfda_ch07_data_transposed.xls，209 
hfda_ch07_new_probs.xls, 219 
hfda_ch09_employees.csv, 255 
hfda_ch 10_employees.csv, 291, 338 
hfda_ch 12_articleHitsGomments.csv, 379 
hfda_ch 12_articles.csv, 367 
hfda_ch 12—issues.csv，367 
hfda—ch 12_sales.csv, 369 
hfda—ch 13_raw_data.csv, 386 
hfda.R, 265 

historical_sales_data.xls, 101 


prediction 

balanced with explanation, 350 
and data analysis, 286 
deviations from, 329—330 
explaining limits of, 322, 326, 332—333, 335, 356 
outside the data range (extrapolation), 321—322, 326, 
356 

and regression equations, 310 
and scatterplots, 294-300 

prevalence, effect of, 174 
previsualizing, 390—393, 414 

prior probabilities (see base rates [prior probabilities]) 
probabilities 

Bayes’ rule and, 182—189 

calculating false positives, negatives, 171—176, 182 
common mistakes in, 172—176 
conditional, 176—177，182 
(see also subjective probabilities) 

probability histograms, 418 
product mixes, 83—89, 100 

V 

quantitative 

constraints, 100 
errors, 332—338 
linking of pairs, 146 
making goals and beliefs, 8 
relationships, 376 
relations in RDBMS, 376 
theory, 233, 303 

querying 

defined, 375 

linear model object in R, 340 
SQL, 379 

question mark (?) in R, 404 
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charting tools in, 129 
cloud function, 291 
command prompt, 264 
commands 
?, 404 

corO, 301-302 
editO, 264 
getwdO, 371, 407 
headO, 291-292, 372, 405 
helpO, 267 
histO, 265-266, 272 
jitterQ, 372 
libraryO, 379—380 
ImO, 306-309, 338, 340, 353-354 
orderO, 409 
plotO, 291-292, 372 
read.csvO, 291 
save.imageO, 265 
sdO, 268, 276 
sourceO, 265 
subO, 405-406 
summaryO, 268, 276, 339 
unique 0, 413 
write.csvO, 413 
xyplotO, 379—380 
community of users, 422 
defaults, 270 
described, 263 
dotchart function in, 211 
histograms in, 265-268 
installing and running, 264—265, 428—429 
pipe character (|) in, 380 
regular expression searches in, 404-408 
scatterplot arrays in, 126 

r (correlation coefficient), 300—303, 338 

=RAND0 in Excel/OpenOffice, 68 

randomized controlled experiments, 40, 66-68, 70, 73, 
113 

randomness, 68, 424 
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Randomness Exposed Interview, 68 
random surveys, 40-44, 50-52, 73, 228-234 
rationality, 238 
raw data 

disassembling, 9—10, 255-259 

evaluating, 28—32 

flowchart for cleaning, 414 

previsualize final data set, 390, 392—394 

using delimiter to split data, 394-395 

using Excel nested searches, 403 

using Excel text formulas, 398—402 

using R regular expression searches, 404-408 

using R to eliminate duplicates in, 408-413 

RDBMS (relational database management system), 
376-378, 382,412, 421 

read.csvf) command in R, 291 

Ready Bake Code 

calculate r in R, 301-302 
generate a scatterplot in R, 291—292 

recommendations (see reports to clients) 
regression 

balancing explanation and prediction in, 350 

and chance error, 335 

correlation coefficient (r) and, 302—303 

Data Analysis for Public Policy (Tufte), 420 

linear, 307-308, 338, 423 

linear correlation and, 299-305 

nonlinear regression, 298, 338, 423 

origin of name, 298 

overview, 279, 298 

polynomial, 423 

and R.M.S. error, 337 

and segmentation, 348, 352, 354 

regression equations, 306, 308-310, 318, 321—326, 356 

regression lines, 298, 308, 321, 337, 348 

regular expression searches, 404-408 

relational database management system (RDBMS), 
376-378, 382,412, 421 

relational databases, 359, 364-370 








the index 


replicability, 303 
reports to clients 

examples of, 16, 34, 96, 136, 246, 248, 356 
guidelines for writing, 14—16, 33, 310 
using graphics, 16, 31, 48, 72, 154, 310 

representative samples, 40, 322 
residual distribution, 336—337 
residuals (see chance error) 


error) 

=RIGHT0 in Excel/OpenOffice, 398, 399, 403 
rise, defined, 305 

Root Mean Squared (R.M.S.) error 

compared to standard deviation, 337 
defined, 336—337 
formula for, 338 

improving prediction with, 342, 354-356 
in R, 339-340, 354 
regression and, 338 

rules of thumb, 238, 244 
run, defined, 305 



sampling, 40, 322, 418 
satisficing, 152 

save.image0 command in R, 265 
scant data, 142, 231—232 

scatterplots 
3D, 291 

creating from spreadsheets in R, 371—373 
drawing lines for prediction in, 294—297 
vs. histograms, 292 
lattices (arrays) of, 126, 291, 379-381 
magnet chart, 290 
overview, 123-124, 291 
regression equation and, 309 
regression lines in, 298-300 


sdO command in R, 268, 276 
segmentation, 346-348, 352, 354 
segments, 266, 318, 343, 350, 353 
self-evaluations, 252 

sigma (see Root Mean Squared [R.M.S.] error) 

slope, 305-308, 340 

Solver utility, 90—94, 100, 431-433 

sorting, 209—210, 409-410 

sourceO command in R, 265 

splitting data, 346-348, 352, 354 

spread of outcomes, 276 

spreadsheets 

charting tools, 129 
linked, 361 ， 366, 369—371，374 
provided by clients, 374 
(see also Excel/OpenOffice) 

SQL (Structured Query Language), 379 

standard deviation 

calculating the, 210, 268, 276 
defined, 208 

and R.M.S. error calculation, 338 
and standard units, 302, 337 
=STDEy 208 

standard units, 302 

statistical models, 22, 27 

=STDEV0 in Excel/OpenOffice, 208, 210 

stereotypes as heuristics, 244 

strip, defined, 296 

Structured Query Language (SQL), 379 

subO command in R, 405-406 

subjective probabilities 
charting, 205-206 
defined, 198 

describing with error ranges, 335 
overcompensation in, 218 
overview, 191 
quantifying, 201 


residual standard error (see Root Mean Squared [R.M.S.] 
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subjective probabilities (continued) 

revising using Bayes’ rule, 217—223 
strengths and weaknesses of, 211 

subsets of data, 271—276, 288 
=SUBSTITUTEO in Excel/OpenOffice, 398-402 
=SUMIF0 in Excel/OpenOffice, 369—370 
summaryO command in R, 268, 276, 339 
summary data, 9—10, 256, 259-262 
surprise information, 18, 212-213 
surveys, 40-44, 50-52, 73, 228-234 

T 

tag clouds, 114, 291 
Test Drive 

Using Excel for histograms, 260—261 
Using R to get R.M.S. error, 339—340 
Using Solver, 93—94 

tests of significance, 418 

theory (see mental models) 

thinking with data, 116 

tilde (〜)，176 

ToolPak (Excel), 431-433 

transformations, 423 

=TRIM0 in Excel/OpenOffice，398 

troubleshooting 

activating Analysis ToolPak, 431—433 
Data Analysis button missing, 260, 431—433 
gaps in Excel/OpenOffice histograms, 262—263 
histogram not in chart format, 261 
read.csvO command in R, 291 
Solver utility not on menu, 90, 431—433 

true negatives, 175—181 

true positives, 176—181 

Tufte ， Edward, 129, 420 

two variable comparisons, 291—292 


TJ 

ultra-specified problems, 237 
uncertainty, 25—27, 342 
uniqueO command in R, 413 
Up Close 

conditional probability notation, 176 

confounding, 64 

correlation, 302 

histograms, 263 

your data needs, 78 

your regular expression, 406 

V 

=VALUE0 in Excel/OpenOffice, 398 
variables 

decision, 79-80, 92, 233 

dependent, 124, 423 

independent, 103， 124 

linked, 103, 146-148 

multiple, 84, 123—126, 129, 291， 359 

two, 291-292 

variation, internal, 50 

vertical bar (|) 

in Bayes’ rule, 176 
in R commands, 380 

Visual Basic for Applications (VBA), 407 
visualizations 

Beautiful Evidence (Tufte), 420 
causal diagrams, 46, 48 
data art, 129 

examples of poor, 83, 114-115 
fast and frugal trees, 239, 242, 244 
making the right comparisons, 120—123 
multi-panel lattice, 291 
multivariate, 123, 125—126, 129, 291 
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overview, 111 

in reports, 16, 72, 96 

software for, 129, 211 

(see also histograms; scatterplots) 


tag clouds, 114 
whole numbers, 182 
wildcard search, 406 
write.csvQ command in R, 413 


w 

Watch it! 


X 


always keep an eye on your model assumptions, 323 
always make comparisons explicit, 42 
does your regression make sense?, 306 
way off on probabilities, 172, 184 

websites 

to download R, 264, 428 
Edward Tufte, 420 
Head First, 416 


xyplotO command in R, 379-380 

Y 

y-axis intercept, 304, 307, 340 
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